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Abstract 

We  show  the  interconvertibility  of  context-free-language  reachability  problems  and  a  class  of  set- 
constraint  problems:  given  a  context-free-language  reachability  problem,  we  show  how  to  construct  a 
set-constraint  problem  whose  answer  gives  a  solution  to  the  reachability  problem;  given  a  set-constraint 
problem,  we  show  how  to  construct  a  context-free-language  reachability  problem  whose  answer  gives  a 
solution  to  the  set-constraint  problem.  The  interconvertibility  of  these  two  formalisms  offers  an  concep¬ 
tual  advantage  akin  to  the  advantage  gained  from  the  interconvertibility  of  finite-state  automata  and 
regular  expressions  in  formal  language  theory,  namely,  a  problem  can  be  formulated  in  whichever  for¬ 
malism  is  most  natural.  It  also  offers  some  insight  into  the  “0(n3)  bottleneck”  for  different  types  of 
program-analysis  problems  and  allows  results  previously  obtained  for  context-free-language  reachability 
problems  to  be  applied  to  set-constraint  problems  and  vice  versa. 

Key  Words:  definite  set  constraints,  context-free-language  reachability,  path  problem,  program  analysis, 
complexity  of  program-analysis  problems. 


1  Introduction 

This  paper  concerns  algorithms  for  converting  between  two  techniques  for  formalizing  program-analysis  prob¬ 
lems:  context-free-language  reachability  and  a  class  of  set  constraints.  Context-free-language  reachability 
(CFL-reachability)  is  a  generalization  of  ordinary  graph  reachability  (i.e.,  transitive  closure).  It  has  been 
used  for  a  number  of  program-analysis  applications,  including  interprocedural  slicing  [23,  25],  interprocedural 
dataflow  analysis  [24],  and  shape  analysis  [37]. 

Set  constraints  have  been  applied  to  program  analysis  by  using  them  to  collect  (a  superset  of)  the  set 
of  values  that  the  program’s  variables  may  hold  during  execution.  Typically,  a  set  variable  is  created  for 
each  program  variable  at  each  program  point.  Set  constraints  are  then  generated  that  approximate  the 
program’s  behavior.  Program  analysis  then  becomes  a  problem  of  finding  the  least  solution  of  the  set- 
constraint  problem.  Set  constraints  have  been  used  for  program  analysis,  including  [2,  17,  19,  29,  43],  and 
type  inference,  including  [3,  4]. 

xThis  work  was  supported  in  part  by  the  National  Science  Foundation  under  grants  CCR-9100424  and  CCR-9625667,  and  by 
the  Defense  Advanced  Research  Projects  Agency  (monitored  by  the  Office  of  Naval  Research  under  contracts  N00014-92-J-1937 
and  N00014-97-1-0114)  and  in  part  by  a  grant  from  IBM. 
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Numerous  classes  of  set  constraints  have  been  identified  and  studied.  Except  for  Section  5,  the  class  of 
set  constraints  considered  in  this  paper  is  a  subclass  of  what  have  been  called  definite  set  constraints  [18]; 
throughout  the  paper,  the  term  “set  constraints”  refers  to  the  class  of  set  constraints  defined  in  Section  2.2. 

The  principal  contribution  of  this  paper  is  to  relate  these  two  formalisms: 

•  We  give  a  construction  for  converting  a  CFL-reachability  problem  into  a  set-constraint  problem.  This 
construction  can  be  carried  out  in  0(n  +  e )  time,  where  n  is  the  number  of  nodes  in  the  graph,  and  e 
is  the  number  of  edges  in  the  graph. 

•  We  give  a  second  construction  for  converting  a  set-constraint  problem  into  a  CFL-reachability  problem. 
Again  the  construction  can  be  carried  out  in  time  linear  in  the  size  of  the  set-constraint  problem. 

We  gain  several  benefits  from  knowing  that  these  two  program-analysis  formalisms  are  interconvertible: 

•  There  is  an  advantage  from  the  conceptual  standpoint:  When  confronted  with  a  program-analysis 
problem,  one  can  think  and  reason  in  terms  of  whichever  paradigm  is  most  appropriate.  (This  is 
analogous  to  the  situation  one  has  in  formal  language  theory  with  finite-state  automata  and  regular 
expressions,  or  with  pushdown  automata  and  context-free  grammars.)  For  example,  CFL-reachability 
leads  to  natural  formulations  of  interprocedural  dataflow  analysis  [25]  and  interprocedural  slicing  [40, 
23].  Set-constraints  lead  to  natural  formulations  of  shape  analysis  [30,  43].  Each  of  these  problems  could 
be  formulated  using  the  (respective)  opposite  formalisms — our  interconvertibility  result  formulates  this 
idea  precisely — but  it  would  be  awkward. 

•  These  constructions  also  offer  some  insight  into  the  “0(n3)  bottleneck”  for  program- analysis  problems. 
That  is,  a  number  of  program-analysis  problems  are  known  to  be  solvable  in  time  0(n3),  but  no  sub- 
cubic-time  algorithm  is  known.  This  is  sometimes  (erroneously)  attributed  to  the  need  to  perform 
transitive  closure  when  a  problem  is  solved.  However,  because  transitive  closure  can  be  performed  in 
sub-cubic  time  [13],  this  is  not  the  correct  explanation.  We  have  long  believed  that,  in  many  cases, 
real  source  of  the  0(n3)  bottleneck  is  that  a  CFL-reachability  problem  needs  to  be  solved.  This  paper 
shows  this  to  be  the  case  for  the  class  of  definite  set-constraint  problems.1 

•  CFL-reachability  is  known  to  be  log-space  complete  for  polynomial  time  (or  PTIME-complete)  [1, 
38,  48].  Because  the  CFL-reachability  to  set-constraint  construction  can  be  performed  in  log-space, 
this  paper  demonstrates  that  a  class  of  set-constraint  problems  are  also  PTIME-complete.  Because 
PTIME-complete  problems  are  believed  not  to  be  efficiently  parallelizable  (i.e.,  cannot  be  solved  in 
polylog  time  on  a  polynomial  number  of  processors),  this  paper  extends  the  class  of  program-analysis 
problems  that  are  unlikely  to  have  efficient  parallel  algorithms. 

•  A  demand  algorithm  computes  a  partial  solution  to  a  problem,  when  only  part  of  the  full  answer  is 
needed.  For  example,  a  demand  algorithm  might  be  used  to  compute  the  results  of  a  program  analysis 
only  for  points  in  the  innermost  loops  of  a  given  program.  Because  CFL-reachability  problems  can 
be  solved  in  a  demand-driven  fashion  (e.g.,  see  [37,  36]),  this  paper  shows  that  (in  principle)  set- 
constraint  problems  can  also  be  solved  in  a  demand-driven  fashion.  To  our  knowledge,  this  has  not 
been  investigated  before  in  the  literature  on  set  constraints. 

•  CFL-reachability  lends  itself  to  analysis  of  languages  with  a  lazy  semantics  [37].  Set  constraints  with 
strict  semantics  are  more  readily  used  to  analyze  languages  with  a  strict  semantics.  However,  our 
interconvertibility  results  show  that  CFL-reachability  can  be  used  to  analyze  strict  languages,  and  set 
constraints  with  strict  semantics  can  be  used  to  analyze  lazy  languages. 

lrThe  source  of  the  0(n3)  bottleneck  has  also  been  attributed  to  the  need  to  solve  a  dynamic  transitive-closure  problem. 
The  basis  for  this  statement  is  that  several  cubic-time  algorithms  for  solving  program-analysis  problems  maintain  the  transitive 
closure  of  a  relation  in  an  on-line  fashion  (i.e.,  as  a  sequence  of  insertions  into  the  relation  is  performed).  At  the  present  time, 
no  sub-cubic-time  algorithm  is  known  for  this  version  of  the  dynamic  transitive-closure  problem. 

In  a  CFL-reachability  problem,  new  base  facts  (in  the  form  of  graph  edges  or  grammar  productions)  are  not  added  to  the 
problem  in  an  on-line  fashion.  (When  dynamic  programming  is  used  to  solve  CFL-reachability  problems,  additional  edges  are 
inserted  in  the  graph;  however,  in  this  case,  the  edges  are  added  by  the  algorithm  and  not  inserted  by  an  outside  agent.)  Thus, 
we  feel  that  the  statement  “a  CFL-reachability  problem  needs  to  be  solved”  offers  a  declarative  characterization  of  the  source 
of  the  0(n3)  bottleneck. 
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A  ::=  B  C 

A  production  of  a  context  free  grammar 

An  edge  labelled  A  from  node  V)  to  node  Vj 

c<yu...,vr) 

An  atomic  expression  of  arity  r  used  in  set  constraints 

x  D  c(V L,... 

. ,  Vr)  A  set  constraint 

X  =>a 

A  production  of  a  regular  term  grammar 

Table  1:  Notation  used  throughout  this  paper. 

A  different  class  of  set  constraints  has  been  used  by  Heintze  to  formulate  analysis  problems  for  a  higher- 
order  language  (ML)  [17].  In  Section  5,  we  show  how  set-constraint  problems  of  this  class  can  be  converted 
to  CFL-reachability  problems  while  preserving  cubic-time  solvability  (i.e.,  cubic  in  the  size  of  the  original 
problem).  A  notable  aspect  of  this  result  is  that  it  demonstrates  that  the  CFL-reachability  framework 
is  capable  of  expressing  analysis  problems,  such  as  program  slicing  and  shape  analysis,  for  higher-order 
languages.  All  previous  applications  of  CFL-reachability  to  program  analysis  have  been  limited  to  first-order 
languages. 

For  all  three  constructions  there  is  a  thorny  issue  that  we  must  address:  When  we  plug  the  various 
parameters  that  characterize  the  size  of  the  transformed  problems  into  the  standard  formulas  for  the  worst- 
case  asymptotic  running  time  in  which  the  transformed  problems  can  be  solved,  it  appears  that  both  of  our 
constructions  cause  a  blowup  in  the  time  required  to  solve  the  problem.  That  is,  from  the  standpoint  of  worst- 
case  asymptotic  running  time,  it  appears  that  we  do  worse  by  performing  the  transformation  and  solving  the 
transformed  problem.  If  this  were  true,  it  would  not  be  a  satisfactory  demonstration  of  interconvertibility. 
In  Sections  3.5,  4.4,  and  5.3  we  examine  this  issue  and  show  that,  in  fact,  the  asymptotic  running  time  of 
the  constructed  problems  is  the  same  as  the  problems  they  were  constructed  from. 

We  assume  that  the  reader  is  familiar  with  context-free  grammars.  In  Section  2,  we  define  CFL- 
reachability  and  a  class  of  set-constraint  problems,  and  describe  dynamic-programming  algorithms  that 
can  be  used  to  solve  them.  Section  2  also  defines  regular  term  grammars,  which  are  used  to  give  finite 
presentations  of  solutions  to  set-constraint  problems.  In  Section  3,  we  show  how  to  express  CFL-reachability 
using  set  constraints  and  discuss  the  running  time  of  the  dynamic-programming  algorithm  on  the  resulting 
problem.  In  Section  4,  we  discuss  how  to  restate  set-constraint  problems  as  CFL-reachability  problems  and 
again  examine  the  running  time  of  the  dynamic-programming  algorithm.  In  Section  5,  we  show  how  to 
encode  a  second  class  of  set-constraint  problems  as  CFL-reachability  problems.  In  Section  6  we  show  how  to 
express  CFL-reachability  problems  with  this  second  class  of  set-constraints.  Section  7  offers  some  concluding 
remarks. 


2  Background 

To  understand  the  interconvertibility  result,  it  is  necessary  to  have  a  grasp  of  the  problem  domains  that  we 
are  working  with  and  the  algorithms  for  solving  these  types  of  problems.  Table  1  summarizes  some  of  the 
notational  conventions  we  will  use  in  the  paper. 

2.1  CFL-Reachability 

In  this  section,  we  define  CFL-reachability  and  describe  a  dynamic-programming  algorithm  for  solving  the 
all-pairs  CFL-reachability  problem. 

Definition  2.1  Let  CF  be  a  context-free  grammar  over  an  alphabet  of  terminal  symbols  T  and  non-terminal 
symbols  N.  (Unless  explicitly  noted,  we  will  follow  the  convention  of  starting  terminal  symbols  with  lowercase 
letters,  and  starting  non-terminal  symbols  with  uppercase  letters.)  Let  G  be  a  directed  graph  whose  edges 
are  labeled  with  members  of  E  =  TU  N.  Each  path  in  G  defines  a  word  over  E,  namely,  the  word  obtained  by 
concatenating,  in  order,  the  labels  of  the  edges  on  the  path.  A  path  in  G  is  an  S-path  if  its  word  is  derived 
from  the  start  symbol  5  of  grammar  CF.  We  define  four  varieties  of  context-free-language  reachability 
problems  (CFL-reachability  problems),  as  follows: 
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•  The  all-pairs  S-path  problem  is  to  determine  all  pairs  of  nodes  n\  and  n 2  such  that  there  exists  an 
.S'-path  in  G  from  rii  to  n-2- 

•  The  single-source  S-path  problem  is  to  determine  all  nodes  712  such  that  there  exists  an  S'-path  in  G 
from  a  given  source  node  n  1  to  ■ 

•  The  single-target  S-path  problem  is  to  determine  all  node  n  1  such  that  there  exists  an  S'-path  in  G 
from  rii  to  a  given  target  node  n^. 

•  The  single-source/single-target  S-path  problem  is  to  determine  whether  there  exists  an  S-path  in  G 
from  a  given  source  node  rii  to  a  given  target  node  ri2. 


2.1.1  Solving  CFL- Reachability  Problems 

We  now  give  a  dynamic-programming  algorithm  for  solving  all-pairs  CFL-reachability  problems.  We  are 
given  a  graph  G  whose  edges  are  labelled  with  terminal  symbols  from  a  context-free  grammar.  To  find  the 
S-paths  in  this  graph,  we  go  through  a  process  of  “filling  in”  the  graph  with  new  edges,  which  are  labelled 
with  non-terminal  symbols.  A  new  edge  labelled  A  from  node  i  to  node  j  indicates  that  there  is  an  A-path 
from  node  i  to  node  j.  (As  indicated  in  Table  1,  we  use  the  notation  A(i,j )  to  represent  an  edge  labelled  A 
from  node  i  to  node  j.)  When  this  process  is  completed,  there  will  be  an  edge  labelled  S  between  any  two 
nodes  connected  by  an  .S'-path.  This  idea  is  formalized  in  the  following  algorithm: 

Algorithm  2.1  (CFL-Reachability  Algorithm) 

1.  Normalize  the  grammar:  In  order  for  this  process  to  work  efficiently,  we  first  convert  the  grammar 
to  a  normal  form  in  which  the  right-hand  side  of  each  production  has  at  most  two  symbols  from  TUJV2. 
This  can  be  done  by  introducing  new  non-terminal  symbols.  Thus,  a  production  such  as 

A  ::=  a  B  C  d 

might  be  converted  into  these  productions: 

A  ::=  A'  A " 

A’  ::=  a  B 
A"  ::=  C  d 

This  transformation  can  be  done  in  time  linear  in  the  size  of  the  grammar  and  causes  a  linear  blowup 
in  the  size  of  the  grammar.  When  the  grammar  is  in  normal  form,  each  production  will  have  one  of  the 
forms  A  ::=  M  N,  B  ::=  P,  or  C  ::=  e,  where  A,  B,  and  C  are  nonterminals,  M,  N,  P  are  terminals 
or  nonterminals,  and  e  represents  the  empty  string. 

2.  Create  the  initial  worklist:  Let  W  be  a  worklist  of  edges.  Initialize  W  with  all  of  the  edges  in  the 
original  graph. 

3.  Add  edges  for  e-productions:  The  production  A  ::=  e  indicates  that  there  is  a  length-0  A-path 
from  each  node  i  to  itself  (see  Figure  1(a)).  Hence: 

for  each  production  of  the  form  A  ::=  e  do 
for  each  node  i  in  the  graph  do 

if  the  edge  A(i,  i )  is  not  in  G  then  add  A(i,  i )  to  G  and  to  W  fi 

od 

od 


2The  normal  form  used  is  similar  to  Chomsky  Normal  Form,  except  that  epsilon  productions  are  allowed,  and  there  are  no 
restrictions  on  where  terminal  symbols  may  appear. 
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i  i  j  i  j  i  j 


(a)  for  A  ::=  e  (b)  for  A  ::=  B  (c)  for  A  ::=  B  C  (d)  for  A  ::=  C  B 

Figure  1:  Edge  induction  in  the  CFL-reachability  Algorithm  (Algorithm  2.1).  The  figures  show  how  a 
production  of  the  context-free  grammar  causes  the  algorithm  to  add,  or  induce ,  an  edge  in  the  graph 
(dashed  lines  show  induced  edges). 

4.  Add  edges  for  other  productions:  To  determine  where  to  add  other  edges  to  the  graph,  the  current 
edges  must  be  examined. 

while  W  is  not  empty  do 

Select  and  remove  an  edge  B{i,j)  from  W 

/*  Step  4.1  :  look  for  productions  of  the  form  A  ::=  B  (see  Figure  1(b)).  */ 

for  each  production  of  the  form  A  ::=  B  do 

if  the  edge  A{i,j )  is  not  in  G  then  add  A{i,j )  to  G  and  to  W  fi 

od 

/*  Step  4.2  :  look  for  productions  of  the  form  A  ::=  B  C.  For  each  such  production,  for  each  */ 

/*  edge  C{j,k),  add  A(i,k)  (see  Figure  1(c)).  */ 

for  each  production  of  the  form  A  ::=  B  C  do 
for  each  outgoing  edge  C(j,  k)  from  node  j  do 

if  the  edge  A(i,  k)  is  not  in  G  then  add  A(i,  k)  to  G  and  to  IT  fi 
od 
od 

/*  Step  4.3  :  look  for  productions  of  the  form  A  ::=  C  B.  For  each  such  production,  for  each  */ 

/*  edge  C(k,i),  add  A{k,j)  (see  Figure  1(d)).  */ 

for  each  production  of  the  form  A  ::=  C  B  do 
for  each  incoming  edge  C{k.  i )  into  node  i  do 

if  the  edge  A{k,j )  is  not  in  G  then  add  A{k,j )  to  G  and  to  W  fi 
od 
od 
od 

5.  Return  the  set  {(i,j)\S{i,j)  £  G}. 

□ 

Note  that  the  other  varieties  of  CFL-reachability  problems — single-source,  single-target,  and  single- 
source/single-target  problems — can  be  solved  by  solving  the  corresponding  all-pairs  problem.  [24]  describes 
a  demand  version  of  the  CFL-Reachability  Algorithm  (tailored  for  a  certain  matched-parenthesis  reachabil¬ 
ity  problem)  that  is  usually  more  efficient  for  the  single-source,  single-target,  and  single-source/single-target 
CFL-reachability  problems.  (In  some  cases,  the  demand  algorithm  in  [24]  performs  the  same  amount  of  work 
as  the  CFL-Reachability  Algorithm  given  here.)  Section  7.6  gives  a  more  detailed  discussion  of  demand  al¬ 
gorithms. 

We  now  show  that  the  running  time  of  the  CFL-Reachability  Algorithm  is  bounded  by  0(|S|3n3),  where 
E  is  the  set  of  terminals  and  nonterminals  in  the  normalized  grammar,  and  n  is  the  number  of  nodes  in  the 
graph.  The  running  time  is  dominated  by  the  amount  of  work  performed  in  steps  4.2  and  4.3.  In  these  steps, 
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(9(l£l)  possible  edges 
from  each  source. 


0(n )  possible 
source  nodes. 


Figure  2:  In  a  graph  from  a  CFL-reachability  problem, 
the  number  of  edges  into  any  given  node  is  bounded 
by  0(|£|n),  where  S  is  the  alphabet  of  the  grammar, 
and  n  is  the  number  of  nodes  in  the  graph. 


each  edge  added  to  the  graph  is  potentially  paired  with  each  of  its  neighboring  edges.  This  is  equivalent  to 
saying  that  each  pair  of  neighboring  edges  is  considered;  that  is,  for  each  node  j,  each  incoming  edge  A(i,j) 
is  potentially  paired  with  each  outgoing  edge  B{j,  k) . 

For  any  given  node  j,  the  number  of  incoming  edges  is  bounded  by  |E|n  (see  Figure  2).  Similarly,  the 
number  of  outgoing  edges  from  j  is  bounded  by  |£|n.  This  means  that  the  total  number  of  edge  pairings 
that  j  ever  participates  in  is  bounded  by  |£|2n2.  For  any  given  edge  pair  B(i,j )  and  C{j,k),  the  number 
of  productions  that  may  have  aB  C”  as  the  body  of  the  production  is  bounded  by  |£|.  Node  j  is  one  of 
n  nodes;  consequently  the  total  amount  of  work  performed  during  any  run  of  the  algorithm  is  bounded  by 

0(|X|V). 

For  a  fixed  grammar,  |£|  is  constant,  and  therefore  an  all-pairs  CFL-reachability  problem  can  be  solved 
in  time  0(n3)  (where  the  constant  of  proportionality  is  cubic  in  |£|). 

2.2  Set  Constraints 

In  this  section,  we  define  a  class  of  set  constraints.  The  material  in  this  section  is  a  summary  of  work  done 
by  Heintze  and  Jaffar  [16,  17,  18]. 

2.2.1  Set  Expressions  and  Set  Constraints 

In  the  class  of  set  constraints  we  deal  with,  a  set  expression  is  either  a  set  variable  (denoted  by  V,  W,  X, 
etc.)  or  has  one  of  the  following  forms: 

•  c(Vu...,Vr).  An  expression  of  this  form  is  called  an  atomic  expression,  and  c  is  called  a  constructor 
or  a  function  symbol.  When  set  constraints  are  used  for  program  analysis,  atomic  expressions  are 
typically  used  to  model  data  constructors  of  the  language  being  analyzed  (e.g.,  cons).  All  constructors 
have  a  fixed  arity  greater  than  or  equal  to  zero.  We  will  follow  the  convention  of  abbreviating  nullary 
constructors  as  c,  rather  than  writing  c(). 

•  c)“1(E).  An  expression  of  this  form  is  called  a  projection.  Projections  can  be  used  to  model  selection 
operators  (such  as  car  and  cdr).  The  subscript  of  a  projection  indicates  which  field  of  the  corresponding 
constructor  is  selected. 
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In  the  class  of  problems  we  consider,  all  set  constraints  are  of  the  form  V  3  sexp,  where  sexp  is  a  set 
expression. 

The  following  example  should  clarify  how  set  constraints  can  be  used  for  program  analysis: 

Example  2.2  Suppose  a  program  contains  the  following  bindings: 
x  =  cons(y,z)  w  =  cdr(x) 

This  would  generate  the  constraints  X  3  cons(Y,Z)  and  W  3  consf1(X).  In  the  second  constraint,  the 
projection  consf1(X)  models  cdr,  asking  for  the  second  element  of  each  cons  value  in  X.  □ 

2.2.2  Solutions  to  Set  Constraints 

A  ground  term  over  a  set  of  constructors  is  either  a  nullary  constructor  or  has  the  form  c(v i  . . .  vr )  where 
Vi . . .  vr  are  ground  terms.  Thus,  given  the  nullary  constructor  a  and  the  unary  constructor  succ,  examples 
of  ground  terms  include  a,  succ(a),  succ(succ(a)) ,  etc. 

A  solution  to  a  collection  of  set  constraints  is  a  mapping  from  set  variables  to  sets  of  ground  terms  of 
constructors  such  that  the  constraints  are  satisfied.  If  we  have  a  mapping  X  from  set  variables  to  sets  of 
ground  terms,  then  the  mapping  can  be  extended  to  map  set  expressions  to  sets  of  values: 

•  X(c(V 1,  .  .  V,.))  =  {C(V l,  .  .  .  ,  Vr)\vi  £  I(V i),  .  .  .  ,  Vr  £  I(Vr)} 

•  l(cT1(V))  =  {vi\c(v1,..^vr)£l(y)} 

(Note  that  this  definition  of  I  is  strict  with  regards  to  the  arguments  of  constructors;  the  expression 
c(Vi, . . . ,  VT)  is  mapped  to  a  nonempty  value  if  and  only  if  Vi,...,Vr  are  all  mapped  to  nonempty  val¬ 
ues.)  X  is  said  to  satisfy  a  constraint  X  3  sexp  if  X(X)  3  X(sexp).  X  is  said  to  be  a  solution  to  a  collection 
of  constraints  if  X  satisfies  each  of  the  constraints. 

An  issue  of  how  to  represent  a  solution  to  a  collection  of  set  constraints  arises  because  a  solution  may 
consist  of  an  infinite  set.  Furthermore,  a  collection  of  set  constraints  may  have  multiple  solutions. 

Example  2.3  Consider  the  following  constraints: 

X  3  a  13  succ(X) 

One  solution  to  these  constraints  maps  X  to  the  infinite  set  {a,  succ(a),  succ(succ(a)), . . .}.  Another  solution 
maps  X  to  the  infinite  set  {cons(a,  a),  succ(cons(a,  a)),  ...  , a,  succ(a),  succ(succ(a)), . . .}.  □ 

We  will  always  be  interested  in  least  solutions  (under  the  subset  ordering),  e.g.,  the  first  of  the  two  solutions 
listed  in  the  above  example.  Heintze  formalizes  this  idea  in  [16].  Note  that  a  collection  of  set  constraints 
must  always  have  a  solution.  In  particular,  the  map  that  sends  each  variable  to  the  set  of  all  ground  terms 
is  a  trivial  solution  to  any  collection  of  constraints.  (This  is  not  generally  true  of  all  classes  of  constraints; 
it  holds  here  because  our  constraints  never  restrict  a  ground  term  from  appearing  in  the  solution  set  of  any 
variable.) 

The  solution  to  a  collection  of  set  constraints  can  be  written  as  a  regular  term  grammar  [14],  which  is  a 
formalism  that  allows  certain  infinite  sets  of  terms  to  be  represented  in  a  finite  manner.  There  are  standard 
algorithms  for  dealing  with  regular  term  grammars  (e.g.,  for  determining  membership)  [14]. 

A  regular  term  grammar  consists  of  a  finite,  non-empty  set  of  non-terminals,  a  set  of  function  symbols,  and 
a  finite  set  of  productions.  Each  function  symbol  has  a  fixed  arity.  Productions  are  of  the  form  N  =>■  term 
where  IV  is  a  non-terminal.  A  term  is  a  non-terminal  or  of  the  form  c{term\, . . .  ,termr),  where  c  is  a 
function  symbol  of  arity  r.  As  with  other  grammars,  a  derivability  relation  is  defined.  Given  a  production 
N  =>  term,  termi  derives  term2  (denoted  by  term\  =>■  term, 2)  if  term, 2  is  obtained  from  termi  by  replacing 
an  occurrence  of  N  in  termi  with  term.  The  reflexive,  transitive  closure  =>*  is  defined  as  usual. 

The  regular  term  grammar  that  describes  the  least  solution  of  Example  2.3  above  has  these  productions: 

X  =>■  a  X  =>  succ(X) 
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2.2.3  Solving  Set  Constraints 

The  reader  may  notice  that  in  Example  2.3  the  set  constraints  X  3  a  and  X  3  succ(X)  look  very  sim¬ 
ilar  to  the  productions  X  a  and  X  =>  succ(X)  of  the  regular  term  grammar  specifying  the  solution. 
Such  constraints  are  said  to  be  in  explicit  form  [16]:  A  constraint  is  in  explicit  form  if  it  is  of  the  form 

V  3  c(V i, . . . ,  Vr).  A  collection  of  constraints  in  explicit  form  is  converted  to  a  regular  term  grammar  by 
taking  the  variables  to  be  non-terminals  and  converting  each  3  into  =>. 

For  any  collection  of  constraints  C,  we  say  that  a  variable  X  is  ground  if  the  least  solution  to  the 
constraints  of  C  that  are  in  explicit  form  does  not  map  X  to  the  empty  set  (i.e.,  X  is  mapped  to  some 
ground  term  in  the  least  solution).  We  say  that  c(V i , . . . ,  Vr)  is  ground  if  Vi  . . .  Vr  are  all  ground. 

The  algorithm  for  solving  set  constraints  involves  augmenting  the  collection  of  set  constraints  with  con¬ 
straints  in  explicit  form  until  no  more  can  be  added: 

Algorithm  2.2  (SC-Reduction  Algorithm)  Given  a  collection  of  set  constraints  C.  the  following  steps  are 
repeated  until  neither  step  causes  C  to  change: 

1.  If  X  3  c~x(Y)  and  Y  3  c(V ±, . . . ,  Vr )  both  appear  in  C  and  the  expression  c(V l,  . . . ,  Vr)  is  ground, 
then  add  the  constraint  X  3  V)  to  C,  if  it  is  not  already  there. 

2.  If  X  DY  and  Y  3  c(V L, . . . ,  Vr)  both  appear  in  C,  and  c(Vi, . . . ,  Vr)  is  ground,  then  add  the  constraint 
X  3  c(Vi, . . . ,  VT)  to  C,  if  it  is  not  already  there. 

When  no  more  constraints  can  be  added,  the  constraints  in  explicit  form  are  converted  to  a  regular  term 
grammar;  this  describes  the  least  solution  [16].  □ 

The  SC-Reduction  Algorithm  does  not  generate  new  atomic  expressions;  this  means  that  when  the 
algorithm  finishes,  for  a  fixed  variable  Y,  the  number  of  constraints  of  the  form  Y  3  c(V i,  V2,  • . . ,  Vr)  in  C 
is  bound  by  O(k),  where  k  is  the  number  of  atomic  expressions  used  in  C.  The  total  number  of  constraints 
in  C  of  the  form  Y  3  c(Vi,  V2,  ■  ■  ■  ,Vr)  is  bounded  by  0(vk),  where  v  is  the  number  of  set  variables  used 
in  C.  Thus,  the  total  number  of  times  the  first  reduction  step  is  ever  applied  is  bounded  by  0(pkv),  where 
p  is  the  maximum  number  of  projection  constraints  that  can  match  with  a  given  constraint  of  the  form 

Y  3  c(V1,V2,...,Vr). 

The  total  number  of  constraints  in  C  of  the  form  Y  3  X  is  bounded  by  0(v2).  Thus,  the  total  number 
of  times  the  second  step  is  applied  is  bounded  by  0(v2k).  Let  t  be  the  total  number  of  constraints  in  the 
original  problem.  In  the  worst  case,  v,  k,  and  p  are  proportional  to  0(t),  and  the  total  number  of  steps  is 
bounded  by  0(t3). 

The  SC-Reduction  Algorithm  can  be  made  to  run  in  time  0(t3)  by  using  a  worklist  and  a  mark  on  each 
variable  to  track  groundness  information: 

1.  Let  W  be  a  worklist  of  constraints.  Initialize  W  to  {X  3  a  €  C\a  is  a  nullary  constructor}. 

2.  Mark  all  set  variables  as  having  the  property  “not  ground.” 

3.  Perform  the  reduction  steps: 

while  W  is  not  empty  do 

Select  and  remove  a  constraint  X  3  sexp  from  W 
if  X  3  sexp  is  of  the  form  X  3  c(Vi,  Vo, . . ...  Vr)  then 
for  each  constraint  of  the  form  Y  3  c)_1(A)  in  C  do 

if  Y  3  Vi  is  not  in  C  then  Insert  Y  3  Vi  into  C  and  W  fi 
od 

for  each  constraint  of  the  form  Y  3  X  in  C  do 

if  Y  3  c(Vi,  V2, . . . ,  Vr)  is  not  in  C  then  Insert  Y  3  c(Vi,  V2, . . . ,  Vr)  into  C  and  W  fi 

od 

else  if  X  3  sexp  is  of  the  form  X  3  Y  then 

for  each  constraint  of  the  form  Y  3  c(Vi ,  V2 , . . . ,  Vr)  in  C  such  that  V\ , . . . ,  Vr  are  all  ground 

do 

if  X  3  c(Vi,  V2, . . . ,  Vr)  is  not  in  C  then  Insert  X  3  c(Vi,  V2, . . . ,  Vr)  into  C  and  W  fi 
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od 

fi 

if  X  is  not  marked  as  ground  then 
mark  X  as  ground 

for  each  constraint  of  the  form  Y  D  c(. . .  X  . . .)  in  the  original  collection  of  constraints  do 
if  all  set  variables  used  in  c(. . .  X . . .)  are  ground  then 
Insert  Y  D  c(. . .  X  . . .)  into  \'V 

fi 

od 

for  each  constraint  of  the  form  Y  D  X  in  the  original  collection  of  constraints  do 
Insert  Y  D  X  into  \'V 

od 

fi 

od 


To  make  this  run  in  time  0(t3),  constant-time  access  is  needed  to  certain  subsets  of  C  in  different  parts  of 
the  algorithm;  this  can  be  achieved  with  a  constant  amount  of  overhead  if  the  proper  data  structures  ( e.g ., 
matrices)  are  maintained  for  storing  the  subsets.  The  number  of  constraints  of  the  form  X  D  c(V i,V2, . . . ,  Vr) 
that  may  appear  on  the  worklist  is  bounded  by  0(kv );  the  number  of  reductions  performed  on  a  given 
constraint  of  this  form  is  bounded  by  0(p  +  v).  The  number  of  constraints  of  the  form  X  D  Y  that  may 
appear  on  the  worklist  is  bounded  by  0(v 2);  the  number  of  reductions  performed  on  a  given  constraint  of 
this  form  is  bounded  by  O(k). 

For  each  constraint  X  D  sexp  that  appears  on  the  worklist,  a  check  is  performed  to  see  if  X  is  marked 
ground;  these  checks  may  contribute  0(kv  +  v2)  steps  to  the  total  running  time.  When  X  is  first  marked  as 
ground,  an  attempt  is  made  to  propagate  the  new  groundness  information  to  all  of  the  original  constraints 
that  use  X  in  their  right-hand  side;  note  that  groundness  information  need  not  be  propagated  to  generated 
constraints  because  generated  constraints  can  only  be  created  if  their  right-hand  sides  are  ground.  The 
total  number  of  attempts  to  propagate  groundness  information  to  an  original  constraint  of  the  form  Y  D 
c(V l,  V2, . . . ,  Vr)  is  bounded  by  r.  The  total  number  of  attempts  to  propagate  groundness  information  to  an 
original  constraint  of  the  form  Y  D  X  is  1.  Since  r  is  constant,  the  total  amount  of  work  done  to  propagate 
groundness  information  is  bounded  by  0(t). 

Thus,  the  entire  algorithm  runs  in  time  0(pvk  +  kv2  +t),  which  in  the  worst  case  is  proportional  to  0(t3). 


3  Transforming  CFL-Reachability  Problems  into  Set-Constraint 
Problems 

We  now  turn  to  the  method  for  expressing  a  CFL-reachability  problem  as  a  set-constraint  problem.  We  first 
address  how  to  encode  the  graph  using  set  constraints.  We  then  address  how  to  encode  the  productions  of  the 
context-free  grammar.  Finally,  we  examine  the  time  needed  to  solve  the  resulting  collection  of  constraints. 

3.1  Encoding  the  Graph 

The  construction  is  based  on  the  idea  of  representing  each  node  i  with  one  variable  Xi  and  one  nullary 
constructor  nodei.  These  are  linked  by  constraints  of  the  form 

Xi  D  nodei,  for  i  =  1 . . .  n 

In  essence,  nodei  serves  as  a  label  identifying  the  node  to  which  Xi  belongs. 

We  now  need  a  way  to  associate  a  node  with  a  set  of  edges  to  other  nodes.  (As  in  Section  2.1.1,  “edges” 
also  means  the  A-edges  that  may  be  added  to  a  graph  to  represent  A-paths.)  In  the  final  solution,  an  edge 
from  node  i  to  node  j  labelled  A  (where  A  is  a  terminal  or  nonterminal)  is  represented  by  the  fact  that 
the  term  A(nodej)  is  in  the  solution  set  for  variable  Xi.  In  accordance  with  this  goal,  we  use  constraints 
involving  Xi  to  indicate  the  set  of  targets  of  outgoing  edges  from  node  i,  using  unary  constructors  to  encode 
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Figure  3:  Use  of  Dst[A,{\  and  Rchd^B- 1  ^  to  encode  production 
A  ::=  B  C.  The  variable  RchdyB-i  ^  represents  the  set  of 
nodes  reached  by  following  B-e dges  from  i.  The  variable 
Dst^AA]  represents  the  set  of  nodes  to  which  there  should  be 
an  (4-edge  from  node  i. 


the  labels  of  edges.  The  argument  to  a  constructor  c  is  the  target  of  an  encoded  c-edge.  For  example,  if  the 
initial  graph  contains  an  edge  from  node  i  to  node  j  with  label  a,  then  the  initial  collection  of  constraints 
includes 


Xi  3  a(Xj) 

The  set  of  constraints  constructed  in  this  manner  completely  encodes  the  initial  graph. 

3.2  Encoding  the  Productions 

To  encode  the  productions,  we  first  convert  the  context-free  grammar  to  the  normal  form  discussed  in 
Section  2.1.1.  Thus,  we  assume  that  the  right-hand  side  of  each  production  has  no  more  than  two  symbols. 
Now  consider  a  production  of  the  form  A  ::=  B  C,  where  A  is  a  nonterminal,  and  B  and  C  are  either 
nonterminals  or  terminals.  This  production  indicates  that  there  is  an  A-path  from  node  i  to  node  k  when 
there  exists  a  node  j  such  that  there  is  an  5-path  from  node  i  to  node  j,  and  a  U-path  from  node  j  to  node 
k. 

Consider  a  fixed  node  i.  To  what  nodes  should  node  i  have  an  A-edge  (i.e.,  to  what  nodes  is  there  an  A- 
path)?  The  production  A  ::=  B  C  indicates  that  we  should  add  an  A-edge  from  node  i  to  any  nodes  reached 
by  following  B  edges  from  node  i  and  then  following  C  edges.  In  our  representation  of  the  graph,  edges 
are  represented  as  constructors,  and  “following  an  edge”  can  be  encoded  using  projection:  in  particular,  the 
production  A  ::=  B  C  can  be  encoded  for  node  i  by  the  following  compound  set  constraint: 

XiDAiC^iB-^Xi))) 

Note  that  this  constraint  does  not  belong  to  the  class  of  set  constraints  introduced  in  Section  2.2;  however, 

by  introducing  some  additional  set  variables  and  constraints,  it  can  be  rewritten  into  the  proper  form:  We 

introduce  two  set  variables 

Dst[A,i\,  which  represents  the  “destinations”  of  ^4-edges  from  node  i,  and 
Rchd^B- 1  -j,  which  represents  the  nodes  reached  by  following  5-edges  from  node  i. 

We  also  generate  the  following  constraints  to  encode  A  B  C : 

Rchd[B-i  ^  3  5f1(A"i)  (Follow  5  edges  from  node  i) 

Dst[A,i\  3  C^1(Rchd^B- 1  jj)  (Follow  C  edges  from  those  nodes) 

Xi  3  A(Dst[A,t\)  (Add  A  edges  to  the  reached  nodes) 
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Figure  3  depicts  the  use  of  the  set  variables  Rchd^B- 1  ^  and  Dst[A,i]  in  this  encoding. 

These  constraints  encode  the  production  A  ::=  B  C,  but  only  “locally”  for  node  i.  That  is,  the  solution 
to  these  constraints  will  give  the  .4-paths  starting  at  node  i  (assuming  that  the  H-paths  and  C-paths  are 
also  solved  for) .  To  find  all  .4-paths  in  the  graph,  similar  constraints  are  generated  for  all  other  nodes  of  the 
graph. 

We  note  that  the  set  variables  introduced  to  encode  this  production  (i.e.,  Dst[A,i\  and  Rchd^B- 1  may 
also  be  used  in  encoding  other  productions.  For  example,  to  encode  A  ::  =  B  D,  we  need  to  generate  only 
one  new  constraint:  Dst[A.i\  =?  Df1(Rchd^B-i  ij). 

The  above  discussion  shows  how  to  encode  a  production  of  the  form  A  ::=  B  C.  In  a  normalized  CFL 
grammar,  productions  may  also  have  the  form  A  ::  =  B  and  A  ::=  e.  To  encode  a  constraint  of  the  form 
A  ::=  B  at  node  i.  we  generate  the  constraints  Xi  D  A(Dst[A,i\ )  and  Dst[A,i]  3  B/1(Xi).  To  encode  a 
constraint  of  the  form  A  ::=  e,  we  generate  the  constraint  Xi  3  A(Xi). 

This  completes  the  construction  of  the  set-constraint  problem.  As  we  show  in  the  next  section,  the 
solution  to  a  constructed  set-constraint  problem  C  can  be  used  to  obtain  the  solution  to  the  original  CFL- 
reachability  problem  V.  In  particular,  let  H  be  the  regular  term  grammar  that  gives  the  solution  to  C.  Then 
there  is  an  S-path  from  n  to  m  in  the  solution  to  V  if  and  only  if  Xn  =^*  nodem  under  H.  We  give  a  formal 
proof  of  this  in  the  next  section. 

3.3  Correctness  of  the  Construction 

We  now  formally  prove  that  the  solution  to  a  constructed  set-constraint  problem  gives  a  solution  to  the 
original  CFL-reachability  problem.  More  precisely,  we  have  the  following  theorem: 

Theorem  3.1  Let  C  be  the  collection  of  set  constraints  constructed  to  represent  the  context-free  reachability 
problem  V.  Let  G  be  the  graph  that  results  from  running  the  CFL-reachability  Algorithm  on  V.  Let  H  be 
the  regular  term  grammar  that  results  from  solving  C.  Then  there  is  an  edge  A{i,j )  in  G  if  and  only  if 
Xi  =^*  A(nodej)  under  H . 

To  prove  this  theorem,  we  employ  one  lemma.  In  this  lemma,  C,  V,  H,  and  G  are  defined  as  in  Theo¬ 
rem  3.1.  The  key  lemma,  which  is  proved  in  Appendix  A,  is  as  follows: 

Lemma  3.2  LetC'  be  the  collection  of  constraints  that  results  from  running  the  SC-Reduction  Algorithm  on 
C  (i.e.,  C1  is  C  unioned  with  the  collection  of  constraints  generated  by  the  SC-Reduction  Algorithm).  Then 
there  is  an  edge  A{i,j )  in  G  if  and  only  ifC  contains  Xi  3  A(Xj)  and/or  Dst[A,i\  3  nodej. 

Theorem  3.1  follows  immediately  from  Lemma  3.2.  Note  that  H  contains  no  productions  of  the  form 
U  =>  V.  This  means  that  Xi  =>*  A(nodej)  under  H  if  and  only  if  H  contains  productions  of  the  form 
Xi  =>■  A(V)  and  V  =>  nodej.  H  contains  productions  of  this  form  if  and  only  if  C1  contains  Xi  3  A(Xj)  and 
Xj  3  nodej  or  C  contains  Xi  3  A(DstyA,i\)  and  BstyA,{\  3  nodej  (where  A  is  a  nonterminal).  Since  C  (and 
hence  C1)  must  contain  Xj  3  nodej  and  Xi  3  A(Dst[A,i])  (for  each  nonterminal  A,  it  follows  that  H  contains 
the  required  productions  if  and  only  if  C  contains  Xi  3  A(Xj)  or  Dst[A,i]  5  nodej. 

3.4  Performing  the  Construction  in  Log-Space 

It  is  also  easily  shown  that  the  construction  given  in  this  section  can  be  carried  out  by  a  log-space  Turing 
machine.  A  log-space  Turing  machine  has  a  read-only  input  tape,  a  read-write  work  tape  with  O(logx)  cells, 
where  x  is  the  size  of  the  input,  and  a  write-only  output  tape. 

We  claim  that  there  exists  a  log-space  Turing  machine  V\  that  does  the  following:  given  an  arbitrary 
context-free  grammar  CF  on  the  input  tape,  V\  outputs  an  equivalent  context-free  grammar  CF'  that  is  in 
normal  form.  Consider  the  following  typical  context-free  production  q: 

N  ::=  a  b  C  d  E 

This  production  can  be  replaced  with  the  following  productions  (which  are  in  normal  form): 
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N  ::=  a  Tx 
Ti  ::=  b  T2 
T2  ::=  C  T3 
T3  ::=  d  T4 
T4  ::=  E 

A  Turing  machine  V\  can  be  written  that  processes  each  production  q  as  follows:  it  scans  q  left  to  right; 
for  each  position  i  of  the  right-hand  side  of  production  q  (except  the  first  and  last  positions) ,  a  production 
Ti- 1  ::=  a,i  Ti  is  output,  where  Ui  is  the  symbol  at  position  i.  and  Ti  is  a  new  non-terminal  symbol.  T\ 
requires  space  on  the  work  tape  for  one  counter  cut,  which  it  uses  to  generate  new  non-terminal  symbols. 
Since  the  number  of  non-terminals  introduced  is  proportional  to  the  length  x  of  the  context-free  grammar, 
V i  needs  at  most  O(logx)  bits  on  the  work  tape  for  cut.  Let  V[  be  the  log-space  Turing  machine  that  takes 
a  CFL- reachability  problem  as  input,  and  outputs  the  same  CFL-reachability  problem  but  with  a  normalized 
grammar. 

We  also  claim  that  there  exists  a  log-space  Turing  machine  V2  that,  given  a  CFL-reachability  problem 
with  a  context-free  grammar  in  normal  form,  performs  the  construction  of  Sections  3.1  and  3.2.  V2  operates 
in  two  phases:  in  phase  I,  it  scans  each  edge  e  of  the  graph  G  of  the  CFL-reachability  problem  and  outputs  a 
corresponding  constraint;  in  phase  II,  it  encodes  each  production  of  the  context-free  grammar  for  each  node 
of  the  graph  G.  Phase  I  requires  no  space  on  the  worktape.  Phase  II  requires  space  on  the  work  tape  for 
the  following  items: 

1.  an  index  idx  1  into  the  input  tape  that  points  to  the  current  production. 

2.  an  index  idx 2  into  the  input  tape  that  points  to  the  current  node. 

3.  a  counter  cut  for  producing  unique  set  variables  for  each  constraint  introduced  during  phase  II. 

The  indices  idx  1  and  idx 2  can  be  represented  with  O(logx)  bits,  where  x  is  the  size  of  the  input  problem. 
The  counter  cut  requires  O(logp-n)  bits,  where  p  is  the  number  of  productions  in  the  context-free  grammar 
CF,  and  n  is  the  number  of  nodes  in  the  graph  G.  Note  that  0(\ogp  ■  n)  <  0( logx2)  =  0(2  •  logx). 

For  any  two  log-space  Turing  machines  Q  and  7 Z,  there  is  a  log-space  Turing  machine  that  is  equivalent 
to  the  composition  Q°1Z  [28].  This  means  that  there  is  a  log-space  Turing  machine  V  that  is  equivalent 
to  T2  °V'i  and  performs  the  construction  of  this  section  for  an  arbitrary  context-free  grammar.  Since  CFL- 
reachability  problems  are  PTIME-complete  (i.e.,  complete  for  PTIME  under  log-space  reductions)  [1,  38,  48], 
this  means  that  the  given  class  of  set-constraint  problems  are  also  PTIME-complete  [28]. 

3.5  Analysis  of  the  Running  Time 

In  general,  an  all-pairs  CFL-reachability  problem  can  be  solved  in  time  0(n3),  where  n  is  the  number  of 
nodes  in  the  graph.  The  class  of  set  constraints  considered  can  be  solved  in  time  0(t3)  where  t  is  the  number 
of  constraints.  However,  for  a  set-constraint  problem  constructed  from  a  CFL-reachability  problem,  this 
does  not  yield  a  satisfactory  time  bound — at  least  from  the  standpoint  of  showing  that  the  two  classes  of 
problems  are  interconvertible:  encoding  the  graph  potentially  creates  n  constraints  of  the  form  Xi  D  nodei 
and  e  constraints  of  the  form  Xi  D  a(Xj),  where  e  is  the  number  of  edges  in  the  graph.  Encoding  the 
productions  may  create  0(dn)  constraints,  where  d  is  the  number  of  productions.  Because  e  can  be  as  large 
as  n2,  this  would  give  a  bound  of  0(n6)  on  the  running  time  to  solve  the  set-constraint  problem. 

However,  as  we  now  show,  a  sharper  analysis  yields  a  better  bound  on  the  running  time  for  the  con¬ 
structed  set-constraint  problem.  In  the  discussion  below,  we  use  the  values  defined  in  the  following  table: 
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k  the  number  of  atomic  expressions  in  C 
v  the  number  of  variables  in  C 

p  the  maximum  number  of  projection  constraints  that  can 
match  with  a  given  constraint  in  explicit  form. 
t  the  total  number  of  constraints  in  C 
d  the  number  of  productions  in  the  context-free  grammar 
of  the  original  problem. 

s  the  number  of  symbols  in  the  context-free  grammar  of 
the  original  problem. 

n  the  number  of  nodes  in  the  graph  of  the  original  prob¬ 
lem. 

Recall  that  in  Section  2.2.3,  we  gave  a  tighter  bound  of  0(pkv  +  kv 2  +  t)  for  the  running  time  of  the 
SC-Reduction  Algorithm  on  a  collection  C  of  set  constraints. 

Let  C  be  a  constructed  set-constraint  problem.  Then  the  atomic  expressions  in  C  have  one  of  the  forms 
A(Dst[A,i])  and  A(Xi).  This  means  that  k  is  bounded  by  O(sn).  Each  variable  in  C  has  one  of  the  forms 
Dst[Ati],  Rchd^A-i  and  AV  Thus,  v  is  bounded  by  O(sn).  A  given  constraint  of  the  form  Rchd^B- 1  ^  3  C'(V) 

matches  with  projection  constraints  in  C  of  the  form  Dst[A,i]  3  Cf  1{Rchd^B-\  ^).  A  given  constraint  of  the 
form  Xi  3  B(Xj )  matches  with  projection  constraints  in  C  with  one  of  the  forms  Dst[A,i\  3  B^~1(Xi)  and 
Rchd^B- 1  ^  3  B^1(Xi).  This  means  that  p  is  bound  by  O(s). 

Thus,  the  total  time  needed  to  solve  C  is  bounded  by  0(s  ■  sn  ■  sn  +  sn  ■  ( sn )2  +  dn  +  n2).  Since  d  is 
bounded  by  s3,  it  follows  that  the  run  time  is  bounded  by  0(s3n3).  Since  s  is  a  constant  independent  of  the 
input,  this  gives  a  bound  on  the  running  time  of  0(n3). 


4  Solving  Set-Constraint  Problems  Using  CFL-Reachability 

4.1  Encoding  Set  Constraints  as  Graphs 

4.1.1  The  Idea  Behind  the  Construction 

We  now  turn  to  the  problem  of  encoding  set-constraint  problems  as  CFL-reachability  problems.  The  basic 
technique  is  a  modification  of  work  done  by  Reps  in  using  CFL-reachability  to  do  shape  analysis  [37].  In 
essence,  our  encoding  involves  simulating  the  steps  of  the  SC-Reduction  Algorithm  with  the  productions  of 
a  reachability  problem.  In  the  following  example,  we  show  how  the  SC-Reduction  Algorithm  computes  what 
atomic  expressions  reach  each  set  variable  and  consider  how  this  can  be  simulated  with  a  CFL-reachability 
problem: 

Example  4.1  Consider  the  following  constraints: 

E  3  a 
V2  3  E 

V3  3  cons(V l, y2) 

V4  3  cons^"1(R 3) 

The  SC-Reduction  Algorithm  reduces  the  constraints  V\  3  a  and  V2  3  Vi  by  adding  the  constraint  V2  3  a, 
which  indicates  that  the  atomic  expression  a  reaches  V2.  This  will  be  simulated  in  the  CFL-reachability 
problem  by  nodes  for  a,  Vj,  and  V2,  together  with  edges  Id(a,\ 1)  and  Id(\ \,V2).  The  counterpart  of  the 
reduction  step  is  reachability  in  the  graph:  the  path  made  of  edges  Id(a,V l)  and  Id(V i,V2),  together  with 
the  production  “Id  ::=  Id  Id ”,  yields  an  edge  Id(a,V2).  Just  as  the  SC-Reduction  Algorithm  outputs  the 
regular  term  grammar  production  V2  =>  a  because  of  the  constraint  V2  3  a,  we  output  the  regular  term 
grammar  production  V2  =>  a  because  of  the  edge  Id{a,  V2 ). 

The  SC-Reduction  Algorithm  also  reduces  the  constraints  V3  3  cons(V i,V2)  and  V4  3  cons^"1(R 3)  by 
adding  the  constraint  V4  3  V2.  In  the  CFL-reachability  problem,  this  will  (roughly)  be  simulated  by  the  edges 
cons2(V2,  cons{Vi,V2J),  Id(com(Vi,V2),V3)  andcons^1^,]^)  andtheproduction  “Id  ::=  cons  2  Id  cons ^"1” 
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This  yields  the  edge  Id{\ 2,14),  which  models  the  constraint  V4  D  V2.  Figure  5  shows  the  graph  that  is  con¬ 
structed  to  represent  the  set  constraints  used  in  this  example;  the  construction  of  this  graph  is  explained 
below. 

□ 

With  this  intuition  in  mind,  we  make  our  first  attempt  to  construct  a  CFL-reachability  problem  that  will 
give  the  solution  to  a  set-constraint  problem.  (For  now,  we  ignore  the  clauses  about  ground  expressions  in  the 
SC-Reduction  Algorithm.  Section  4.1.2  covers  the  modifications  needed  to  account  for  ground  expressions.) 

The  CFL-reachability  framework  uses  a  graph  and  context-free  grammar  and  finds  paths  in  the  graph. 
We  want  to  use  this  framework  to  find  what  atomic  expressions  reach  each  set  variable;  we  construct  a 
graph  containing  a  node  for  each  atomic  expression  and  each  set  variable.  This  graph  will  contain  edges  that 
encode  the  set  constraints.  We  construct  a  context-free  grammar  such  that  the  CFL-reachability  Algorithm 
will  find  identity  paths  from  nodes  representing  atomic  expressions  to  nodes  representing  set  variables. 

The  solution  to  the  set-constraint  problem  (in  the  form  of  a  regular  term  grammar)  is  obtained  from  the 
reachability  relations  that  hold  in  the  graph.  If  node  a  represents  an  atomic  expression,  node  V  represents 
a  variable,  and  there  is  an  identity  path  from  a  to  V.  then  the  production  V  =>  a  is  in  the  regular  term 
grammar. 

More  precisely,  the  graph  for  Example  4.1  is  constructed  as  follows  (the  general  construction  is  given  in 
Section  4.2): 

•  For  each  set  variable  V),  the  graph  contains  a  node  labelled  V). 

•  Each  atomic  expression  cons  (Vi,  Vj)  used  in  the  constraints  is  associated  with  a  unique  index.  This  is 
for  notational  convenience;  it  is  easier  to  refer  to  an  expression  by  its  index  than  by  writing  out  the 
expression. 

For  each  atomic  expression  cons  (Vi,  Vj)  with  index  k,  the  graph  contains  a  node  labelled  (k)  and  the 
edges  cons±(Vi,  (k))  and  cons2(Vj,  (k)).  An  edge  conSm(Vi,  (k))  indicates  that  the  values  that  reach  V) 
are  wrapped  in  the  mth  position  of  the  cons  value  represented  by  node  (k).  (See  Figure  4(c)). 

•  For  each  constraint  of  the  form  V)  D  Vj,  the  graph  contains  an  edge  Id{Vj,Vi).  An  edge  labelled  Id 
indicates  an  identity  path  in  the  graph.  An  identity  path  from  node  j  to  node  i  indicates  that  the 
values  that  reach  node  j  also  reach  node  i.  (See  Figure  4(a).) 

•  For  each  constraint  of  the  form  Vi  3  cons  (Vi,  Vj),  where  the  atomic  expression  cons  (Vi,  Vj)  has  index  k, 
the  graph  contains  an  edge  Id((k),Vi).  This  indicates  that  the  atomic  expression  cons(Vi,Vj)  reaches 
Vj.  (See  Figure  4(b).) 

•  For  each  constraint  of  the  form  Vj  3  cons^1( Vj),  the  graph  contains  an  edge  cons^1  (Vj,Vi).  An  edge 
cons^1  (Vj,Vi)  indicates  that  values  at  node  i  are  taken  from  the  kth  position  of  cons  values  at  node 
j.  (See  Figure  4(d).) 

Productions  are  introduced  in  the  context-free  grammar  to  encode  the  simplification  steps  of  the  SC- 
Reduction  Algorithm.  The  first  reduction  step  of  the  SC-Reduction  Algorithm  is  encoded  via  productions 
that  indicate  the  fact  that  values  can  pass  through  cons  values  by  being  wrapped  in  a  cons  and  then 
unwrapped  by  a  projection: 
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Id 


Figure  5:  The  graph  built  to  encode  the  con¬ 
straints  in  Example  4.1.  The  nodes  (j)  and 
(k)  represent  the  atomic  expressions  a  and 
consiV 1,14),  respectively. 


Id 


Figure  6:  The  graph  for  Example  4.1  after 
the  CFL-reachability  Algorithm  has  been  run. 
Dashed  lines  represent  edges  inserted  by  the  algo¬ 
rithm.  The  nodes  (j)  and  ( k )  represent  the  atomic 
expressions  a  and  cons{\ respectively. _ 


Id  ::=  consi  Id  cons1 
Id  ::=  cons2  Id  cons^1 


In  Example  4.1,  the  SC-Reduction  Algorithm  adds  the  constraint  V4  D  Vj  because  of  the  constraints  V3  O 
cons(V 1 ,  V2)  and  V4  D  cons^iV 3).  Let  cons(V i ,  V2)  have  index  k.  Then,  in  the  constructed  graph,  the 
CFL-reachability  algorithm  adds  the  edge  Id(V, 2,14)  because  of  the  edges  cons2(V2,(k)),  Id{(k),V 3),  and 
cons^1  (VziVi)  (see  Figure  6). 

To  encode  the  second  reduction  step  of  the  SC-Reduction  Algorithm,  the  following  production  is  put  in 
the  context-free  grammar: 

Id  ::=  Id  Id 

In  Example  4.1,  the  SC-Reduction  Algorithm  adds  the  constraint  V2  D  a  because  of  the  constraints  V2  D  V± 
and  V2  D  a.  Given  that  the  atomic  expression  a  has  index  j,  the  CFL-reachability  algorithm  adds  the  edge 
Id{(j).  V2)  because  of  the  edges  Id{(j),V±)  and  Id{V±,  V2)  (see  Figure  6). 

Figure  6  shows  the  graph  constructed  from  Example  4.1  after  the  CFL-reachability  Algorithm  is  run. 
The  regular  term  grammar  that  is  the  solution  to  the  set-constraint  problem  can  be  obtained  from  this  graph 
by  examining  Id  edges  from  nodes  representing  atomic  expressions.  Thus,  the  edges  Id((j),V 1),  Id({j).  V2). 
and  Id((j),V 4)  indicate  that  the  atomic  expression  a  reaches  set  variables  Vi,  V2,  and  V4;  this  indicates 
that  the  regular  term  grammar  that  represents  a  solution  to  the  set  constraints  should  contain  the  following 
productions: 

Vl  =>  a 
V2  =>  a 
V4  =>  a 

The  edge  Id{(k),V 3)  indicates  that  the  following  production  should  be  in  the  regular  term  grammar  as  well: 
Vs  =>■  cons(V L,y2) 
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4.1.2  Accounting  for  Ground  Expressions 

For  any  given  set-constraint  problem,  the  construction  of  Section  4.1.1  does  yield  a  regular  term  grammar 
that  describes  a  solution  to  the  problem.  However,  this  regular  term  grammar  does  not  necessarily  describe 
the  least  solution. 

The  problem  is  that  a  production  of  the  form  “Id  ::=  consi  Id  cons~lv  allows  identity  paths  though 
cons  expressions  that  are  not  ground,  and  the  production  “Id  ::=  Id  Id ”  propagates  non-ground  atomic 
expressions.  This  is  at  odds  with  the  simplification  steps  of  the  SC-Reduction  Algorithm.  We  consider  the 
problem  with  productions  of  the  form  “Id  ::=  consi  Id  cons ~lv  first. 

Example  4.2  Let  C  be  a  collection  of  constraints.  Suppose  that  C  is  a  superset  of  the  following  constraints: 
Vi  2  a 

V3  D  cons(Vi,V2) 

V4  D  cons))1  (14) 


In  the  least  solution  to  C,  V3  may  or  may  not  be  ground.  If  V2  is  ground,  then  cons(V l,  14)  is  ground  (since 
Vi  must  be  ground  because  of  the  constraint  Vi  Da),  and  the  SC-Reduction  Algorithm  would  perform  the 
following  steps: 

•  Add  the  constraint  V4  D  14  (because  of  the  constraints  V3  D  cons(V ),14)  and  14  D  cons^1  (V3)) . 

•  Add  the  constraint  14  D  a  (because  of  the  new  constraint  14  D  14  and  the  constraint  14  D  a). 

•  Output  the  production  14  =>  a  (because  of  the  new  constraint  14  Da). 

If  14  ultimately  is  not  ground,  then  the  expression  cons(  14,14)  is  not  ground,  and  the  SC-Reduction  Algo¬ 
rithm  does  not  perform  the  first  two  of  these  steps  and  might  not  generate  the  production  14  =>  a.  (The 
SC-Reduction  Algorithm  may  still  generate  14  =>  a  as  a  result  of  reducing  other  constraints  in  C;  but  it 
would  not  generate  14  =>  a  as  a  result  of  reducing  the  particular  constraints  discussed  above.) 

Figure  7  shows  a  fragment  of  the  graph  created  to  represent  these  constraints  by  the  construction  from 
Section  4.1.1.  (In  Figure  7  and  the  following  discussion,  we  assume  that  the  atomic  expression  a  has  index 
j.  and  the  atomic  expression  cons(  14,14)  has  index  k.)  The  CFL-reachability  algorithm  will  add  the  edge 
Id(Vi .  14)  to  this  graph  regardless  of  whether  or  not  the  expression  cons (14, 14)  is  ground.  This  is  because  of 
the  production  Id  ::=  cons  1  Id  cons)-1  and  the  edges  consi (14,  (fc)),  Id((k),  14),  and  cons)-1  (14, 14)-  Adding 
edge  Id{  14,14)  when  the  expression  cons(14,14)  is  not  ground  may  lead  to  a  non-minimal  solution.  In  the 
remainder  of  the  section,  we  give  a  modified  construction  for  transforming  a  set-constraint  problem  to  a 
CFL-reachability  problem.  With  the  modified  construction,  the  edge  Id(  14 , 14)  would  be  added  if  and  only 
if  the  expression  cons  (14, 14)  is  ground. 

Remark:  Example  4.2  illustrates  why  it  is  natural  to  use  CFL-reachability  for  the  analysis  of  lazy 
languages:  for  these  languages  it  is  proper  to  infer  that  14  receives  the  value  a.  Because  Section  3  gives 
a  construction  for  converting  CFL-reachability  problems  to  set-constraint  problems,  this  shows  that  set- 
constraints  with  strict  semantics  can  be  used  for  the  analysis  of  lazy  languages.  The  latter  is  not  surprising; 
it  is  easy  to  get  strict  constraints  to  behave  as  if  they  have  lazy  semantics  by  artificially  grounding  each  set 
variable  Rby  adding  the  constraint  ED  dummy ,  where  dummy  is  an  otherwise  unused  nullary  constructor. 
For  alternative  treatments  of  lazy  languages  using  set  constraints  see  [26,  27]. 

Example  4.2  suggests  that  CFL-reachability  might  not  be  powerful  enough  to  express  analysis  problems 
for  strict  languages.  The  construction  given  in  the  remainder  of  this  section  shows  that  this  is  not  the  case. 
□ 

We  now  give  a  modified  construction  in  which  the  production  Id  ::=  consi  Id  cons)-1  is  replaced  with 
productions  that  capture  the  groundness  conditions.  To  do  this  we  need  a  technique  for  tracking  additional 
Boolean  information  about  set  variables.  (For  example,  we  need  to  keep  track  of  whether  or  not  a  set  variable 
is  ground.)  In  the  constructed  CFL-reachability  problem,  set  variables  are  represented  by  nodes,  and  we 
will  use  cyclic  edges  to  mark  Boolean  information:  the  value  of  a  Boolean  property  of  a  variable  will  be 
indicated  by  the  presence  or  absence  of  a  cyclic  edge  at  a  node.  Some  of  these  cyclic  edges  are  generated 
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Figure  7:  The  edge  Id{V i,T4)  should  be  induced  if  and  only  if 
consiV- l,  V2)  is  ground.  If  the  edge  Id{\ 1, 14)  is  added  when  cons(Vi,  V2) 
is  not  ground,  it  may  incorrectly  cause  the  edge  Id{a,  V4)  to  be  added, 
and  the  production  V4  a  to  be  output. 

The  nodes  ( j )  and  ( k )  represent  the  atomic  expressions  a  and 
cons(V l ,  V2),  respectively. 


during  the  construction  of  the  graph;  others  are  induced  by  the  CFL-reachability  Algorithm.  The  graph  and 
context-free  grammar  must  be  constructed  properly  for  this  to  happen. 

In  particular,  we  introduce  a  new  kind  of  edge  label,  Ground,  which  will  be  used  to  indicate  that  a 
variable  or  atomic  expression  is  ground:  an  edge  Ground{Vi,Vi)  indicates  that  the  variable  V)  is  known  to 
be  ground,  while  an  edge  Ground{(j),  ( j ))  indicates  that  the  atomic  expression  with  index  j  is  known  to  be 
ground.  In  Figure  7,  the  edges  Ground(\ 1,  Vi)  and  Ground^, V 2)  will  be  added  to  the  graph  if  and  only 
if  Vi  and  V2  are  ground,  respectively.  The  edge  Ground((k),  ( k ))  will  be  added  to  the  graph  if  and  only  if 
cons(V l ,  T2)  is  known  to  be  ground  (i.e.,  if  both  Vi  and  V2  are  ground). 

We  now  illustrate  the  use  of  the  Ground  edges  by  means  of  Example  4.2.  In  Example  4.2,  we  want 
the  graph  to  contain  the  cyclic  edge  Ground{(k),  ( k ))  if  and  only  if  cons(V l,  V2)  is  ground.  In  place  of  the 
production  Id  ::=  consi  Id  cons  j"1,  we  use  the  following  production: 

Id  ::=  consi  Ground  Id  cons^1 

With  this  production,  the  CFL-reachability  Algorithm  will  add  the  edge  Id(\ \ ,  V4)  if  and  only  if  the  edge 
Ground((k),  ( k ))  is  present  (i.e.,  if  and  only  if  cons  (Vi,  V2)  is  ground);  see  Figure  8(a). 

We  now  show  how  to  modify  the  graph  and  the  productions  to  deal  with  Ground  edges.  Some  Ground 
edges  are  generated  when  constructing  the  graph.  In  particular,  for  every  atomic  expression  of  the  form  a 
with  index  j,  we  generate  the  edge  Ground{(j),  ( j )),  because  a  nullary  constructor  is  always  ground. 

Other  Ground  edges  are  induced  during  the  running  of  the  CFL-reachability  Algorithm.  In  Example  4.2, 
the  atomic  expression  cons(V l,!^)  is  ground  if  and  only  if  V±  and  V2  are  both  ground.  We  modify  the 
construction  so  that  the  following  edges  are  also  introduced  in  the  original  graph: 

edge(k)toV\((k),Vi) 
edgeVito{k){Vi,  ( k )) 
edge(k)  toV2((k) ,  V2) 
edgeV2to(k)(V2,  ( k )) 

These  edges  simply  connect  nodes  Vi,  V2,  and  ( k ),  and  allow  us  to  introduce  the  following  production: 
Ground  ::=  edge(k)to\ \  Ground  edgeVito(k)  edge(k)toV2  Ground  edgeV2to(k ) 
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With  this  production  and  the  edges  used  in  it,  the  CFL-reachability  Algorithm  will  induce  the  edge  Ground{(k ) ,  (k)) 
iff  the  edges  Ground{V l,Vi)  and  Ground{V2,\ 2)  exist.  See  Figure  8(b). 

There  is  one  last  situation  we  must  take  into  account:  Suppose  that  in  Example  4.2  the  atomic  expression 
cons(\ 1,14)  (with  index  k)  is  known  to  be  ground,  and  consider  the  constraint  V3  D  cons(Vi,V2);  this 
implies  that  the  variable  V3  is  also  ground.  In  the  graph  constructed  for  this  situation,  we  have  the  edges 
Ground((k),  ( k ))  and  Id((k),V 3),  and  we  want  the  edge  Ground(Vs,V 3)  to  be  added.  In  effect,  we  want  the 
Ground  information  at  ( k )  to  be  propagated  along  the  Id  edge.  To  accomplish  this,  we  introduce  the  edges 
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RevJdiy 3,  (k))  and  edgeVJo\ 3(^3,  V3),  and  the  following  production: 

Ground  ::=  edgeV^toV^  RevJd  Ground  Id  edgeV^toVs 

With  this  production,  the  CFL-reachability  Algorithm  will  add  the  edge  Ground(V 3,^3)  to  the  graph  (see 
Figure  9). 

There  is  one  more  issue  that  is  not  well  illustrated  in  Example  4.2.  In  order  to  propagate  ground 
information  along  an  Id  edge,  we  need  a  corresponding  RevJd  edge.  That  is,  for  any  edge  Id{Vi,Vj)  in  the 
graph,  we  need  an  edge  RevJd{Vj,Vi)  in  the  reverse  direction.  We  now  show  how  these  RevJd  edges  are 
created.  Recall  that  some  Id  edges  are  induced  by  the  CFL-reachability  Algorithm.  If  the  CFL-reachability 
Algorithm  induces  an  edge  Id{Vi,  Vj),  then  we  want  it  to  induce  an  edge  RevJd(Vj,Vi).  To  have  this  happen 
without  changing  the  CFL-reachability  Algorithm,  we  need  to  add  more  productions  to  the  grammar.  For 
example,  the  following  production  indicates  that  the  CFL-reachability  Algorithm  should  induce  an  Id  edge 
(assuming  an  appropriate  path  exists  in  the  graph): 

Id  ::=  consi  Ground  Id  cons])1 

Consequently,  we  need  an  equivalent  “reverse”  production  to  indicate  that  the  corresponding  RevJd  edge 
should  be  induced: 

RevJd  ::=  rev-cons f1  RevJd  Ground  revjconsi 

Figure  10  illustrates  the  use  of  this  reverse  production. 

For  this  production  to  work,  we  need  additional  reverse  edges:  For  every  edge  consi(Vi,Vj)  in  the  graph, 
we  want  the  edge  rev-Cons\(Vj,Vi)  to  be  in  the  graph;  for  every  edge  cons^1  (Vi,  Vj),  we  want  the  edge 
rev-consi1  (Vj,Vi)  to  be  in  the  graph.  Fortunately,  these  reverse  edges  can  be  added  when  we  construct  the 
graph.  They  do  not  require  the  introduction  of  new  productions.  Notice  also  that  an  edge  labelled  Ground  is 
always  cyclic.  Hence,  it  can  serve  as  its  own  reverse  edge  and  so  we  do  not  need  an  edge  labelled  Rev-Ground. 

Now  that  we  have  addressed  the  problems  with  constraints  of  the  form  Id  ::=  consi  Id  cons ]“1,  we  are 
ready  to  address  the  production  Id  ::=  Id  Id.  In  fact  there  are  two  problems  with  this  production: 

1.  Consider  the  constraints  X  D  Y  and  Y  D  cons(Z,  W)  represented  by  the  edges  Id(Y,  X)  and  Id((k),Y). 

The  production  Id  ::=  Id  Id  causes  the  edge  Id((k),X)  to  be  introduced,  regardless  of  whether  or  not 
cons (Z,  W)  is  ground. 

2.  Consider  the  constraints  X  D  Y  and  Y  D  Z,  represented  by  the  edges  Id(Y,X)  and  Id(Z,Y).  The 
production  Id  ::=  Id  Id  causes  the  edge  Id(Z,X )  to  be  introduced. 

In  both  of  these  cases,  the  simplification  steps  of  the  SC-Reduction  Algorithm  are  not  accurately  represented. 

To  fix  this,  for  each  node  ( k )  representing  an  atomic  expression,  we  indicate  that  ( k )  represents  an  atomic 
expression  by  introducing  the  edge  ae((k),  ( k )).  We  replace  the  production  Id  ::=  Id  Id  with  the  following 
production: 

Id  ::=  Ground  ae  Id  Id 

This  production  accurately  encodes  the  second  reduction  step  of  the  SC-Reduction  Algorithm. 

4.2  Summary  of  the  Construction 

Above,  we  presented  the  concepts  of  the  construction  in  terms  of  a  specific  example.  In  this  section,  we 
present  it  for  an  arbitrary  set-constraint  problem.  In  general,  the  CFL-reachability  problem — which  consists 
of  a  graph  and  a  context-free  grammar — is  constructed  as  follows: 

1.  The  context-free  grammar  contains  the  productions 

Id  ::=  Ground  ae  Id  Id 

RevJd  ::=  RevJd  RevJd  Ground  ae 

2.  For  each  set  variable  Vi,  the  graph  contains  a  node  named  Vi,  and  a  uniquely  labelled  edge  edgeVitoVi(Vi,Vi) . 
The  context-free  grammar  contains  the  production 
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Figure  10:  The  production  RevJd  ::  =  rev-cons^1  RevJd  Ground  rev-consi 
causes  the  CFL-reachability  Algorithm  to  produce  RevJd  edges.  (This  produc- 
tion  is  the  counterpart  of  the  production  Id  ::=  cons ±  Ground  Id  cons j-1-) _ 


Ground  ::=  edgeVJoVi  RevJd  Ground  Id  edgeVJoVi 


3.  For  each  atomic  expression  c(V \,V%^ . . . ,  Vr)  with  index  k  used  in  the  set  constraints  the  graph  contains 
a  node  labelled  (k)  and  an  edge  ae((k),  ( k )).  If  c  is  a  nullary  constructor  (i.e.,  r  =  0),  then  the  graph 
contains  the  edge  Ground{(k),  (k)).  Otherwise,  for  each  position  j  of  this  atomic  expression,  the  graph 
contains  the  edges 

Cj{Vj,(k)) 
rev_cj{(k),Vj) 
edge(k)toVj((k),  Vj) 
edgeVjto(k)(Vj ,  ( k )) 

and  the  context-free  grammar  contains  the  productions 

Id  ::=  Cj  Ground  Id  cj 1 

RevJd  ::=  rev-cj 1  RevJd  Ground  rev-Cj 

The  context  free  grammar  also  contains  the  production 

Ground  ::=  edge(k)toV\  Ground  edgeVito(k)  edge{k)toV2  Ground  edgeV2to(k) 

...  edge(k)toVr  Ground  edgeVrto(k) 

4.  For  each  constraint  of  the  form  Vi  D  Vj,  the  graph  contains  edges  Id{Vj,Vi)  and  RevJd(Vi,  Vj). 

5.  For  each  constraint  of  the  form  V  D  c(b \,V2,  ■ . . ,  VT ),  where  c(V i,V2, . . . ,  Vr)  has  index  k,  the  graph 
contains  edges  Id{(k),  V)  and  RevJd{V,  ( k )). 

6.  For  each  constraint  of  the  form  V\  D  c^1( Vj),  the  graph  contains  edges  Vi)  and  reu_c^1(Vi,  Vj). 

After  the  CFL-reachability  Algorithm  is  run  on  a  constructed  problem,  a  tree  grammar  representing  the 
solution  to  the  original  set-constraint  problem  is  generated  as  follows:  For  each  edge  Id((k),Vi),  where  k  is 
the  index  of  the  atomic  expression  c(Li,  V2, ...,  K)  output  the  regular  tree  production  Vi  =>  c(Vl,  V2, . . . ,  VT). 

4.3  Correctness  of  the  Construction 

We  claim  that  the  solution  to  the  CFL-reachability  problem  gives  the  solution  to  the  original  set  constraint 
problem.  Specifically,  we  have  the  following  theorem: 
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Theorem  4.3  LetC  be  a  collection  of  set  constraints,  and  letV  be  the  CFL-reachability  problem  constructed 
to  represent  C.  Let  H  be  the  regular-tree  grammar  produced  by  the  SC-Reduction  Algorithm  when  run  on  C. 
Let  J  be  the  regular-tree  grammar  produced  from  the  solution  to  V  (i.e.,  the  grammar  produced  by  outputting 
a  production  for  each  edge  of  the  form  Id((k),  V)  in  the  solution  to  V).  Then,  H  =  J. 

To  prove  this  theorem,  we  enlist  the  help  of  several  lemmas,  which  are  proved  in  Appendix  B.  In  the  following 
lemmas,  C  and  V  are  defined  as  in  Theorem  4.3.  We  also  have  the  following  definitions: 

C'  is  the  collection  of  set  constraints  that  results  from  running  the  SC-Reduction  Algorithm  on  C  (i.e.,  C' 
is  C  unioned  with  the  constraints  generated  by  the  SC-Reduction  Algorithm). 

G  is  the  original  graph  of  the  CFL-reachability  problem  V. 

G'  is  the  graph  that  results  from  running  the  CFL-reachability  algorithm  on  V  (i.e.,  G'  is  G  augmented 
with  the  edges  added  by  the  CFL-reachability  Algorithm). 

Lemma  4.4  If  C  contains  the  constraint  VO  c(V±,  Vi,  ■  ■  • ,  Vr ),  then  G'  contains  the  edge  Id((k),V),  where 
k  is  the  index  of  c(Vi ,  Vi , . . ,  Vr). 

Lemma  4.5  If  G'  contains  the  edge  Id((k),V),  then  C  contains  the  constraint  V  O  c(V±,  Vi,  ■  ■  ■ ,  Vr)  where 
c(Vi,  Vi, . ..  ,Vr)  is  the  atomic  expression  with  index  k. 

By  Lemma  4.4  and  Lemma  4.5,  we  have  that  C1  contains  VO  c(Vi,  Vi,  ■  ■  ■ ,  Vr)  iff  G'  contains  Id((k),V). 
Theorem  4.3  follows  immediately. 

4.4  Cost  of  Solving  the  Constructed  CFL-Reachability  Problem 

A  CFL-reachability  problem  can  be  solved  in  time  0(|£|3n3),  where  n  is  the  number  of  nodes  in  the  graph 
and  £  is  the  alphabet  of  the  grammar.  Ordinarily,  |£|  is  considered  to  be  a  constant  and  is  ignored;  however, 
in  a  constructed  CFL-reachability  problem,  |£|  is  0(t ),  where  t  is  the  number  of  constraints  and  the  constant 
of  proportionality  depends  on  the  maximum  arity  of  the  constructors.  Since  n  is  also  0(t),  this  gives  us  a 
bound  on  the  running  time  to  solve  the  context-free  reachability  problem  of  0(t6),  which  is  worse  than  the 
bound  of  0(t3)  of  the  SC-Reduction  Algorithm. 

However,  a  closer  examination  of  the  CFL-reachability  Algorithm  shows  that  the  worst-case  time  bound 
is  not  realized  on  constructed  CFL-reachability  problems.  We  will  focus  our  analysis  on  step  4  of  the  CFL- 
reachability  Algorithm  (Algorithm  2.1).  In  this  step,  the  algorithm  processes  each  edge  that  appears  in  the 
(final)  graph.  For  each  edge,  it  examines  the  productions  in  which  that  edge’s  label  appears  on  the  right-hand 
side,  and  attempts  to  add  edges  to  the  graph  when  it  can  complete  the  right-hand  side  of  a  production  by 
matching  the  edge  with  neighboring  edges  in  the  graph.  Recall  that  the  CFL-reachability  Algorithm  will 
not  add  an  edge  to  the  graph  if  the  edge  already  exists. 

The  cost  accounting  argument  presented  in  this  section  goes  as  follows:  We  show  that  for  each  type 
of  label  used  in  the  graph,  the  number  of  edges  with  a  label  of  that  type  is  bounded  by  0(t 2)  (this  gives 
an  upper  bound  on  the  number  of  edges  that  the  CFL-reachability  Algorithm  must  examine).  Also,  for 
any  given  edge  B(i,j)  in  a  constructed  graph,  the  amount  of  work  performed  can  be  broken  down  into  two 
categories: 

1.  The  number  of  productions  examined  by  the  Algorithm:  for  a  given  edge  B(i,j),  this  is  the  number 
of  productions  in  which  B  appears  on  the  right-hand  side  of  the  production.  In  a  constructed  CFL- 
reachability  problem,  this  is  bounded  by  0(t). 

2.  The  number  of  edges  that  the  CFL-reachability  Algorithm  attempts  to  add  to  the  graph:  in  a  con¬ 
structed  CFL-reachability  problem,  this  is  bounded  by  0(t )  over  all  of  the  productions  examined  when 
processing  a  given  edge  B(i,j). 

Thus,  the  total  amount  of  work  performed  by  the  CFL-reachability  Algorithm  on  a  constructed  problem  is 
0(t2)*(0(t)+0(t))  =  0(t3). 

We  start  by  showing  how  a  constructed  grammar  can  be  normalized  in  Section  4.4.1.  In  Section  4.4.2,  we 
present  Table  3  which  summarizes  all  of  the  different  types  of  edge  labels  that  may  be  used  in  a  constructed 
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Figure  11:  Normalization  of  the  production 

Ground  ::=  edgeVjtoVj  Revld  Ground  Id  edgeVjtoVj. 


CFL-reachability  problem,  including  those  introduced  by  the  normalization  of  the  grammar.  For  every  given 
type  of  edge  label,  Table  3  also  shows  a  bound  on  the  number  of  edges  with  a  label  of  that  type,  and  a 
bound  on  the  number  of  steps  the  CFL-reachability  Algorithm  performs  on  any  given  edge  with  a  label  of 
that  type. 

Throughout  the  rest  of  the  section,  we  use  v  to  refer  to  the  number  of  variables  in  the  set  constraint 
problem,  t  to  refer  to  the  number  of  constraints,  n  to  refer  the  number  of  nodes  in  the  graph  (n  =  0(v  +t)), 
and  r  to  refer  to  the  maximum  arity  of  a  constructor. 

4.4.1  Normalization  of  a  Constructed  Grammar 

We  start  by  converting  the  productions  of  the  grammar  to  normal  form.  Consider  the  following  prototypical 
production: 

Ground  ::=  edgeVjtoVj  Revld  Ground  Id  edgeVjtoVj 

There  are  v  productions  of  this  form,  one  for  each  node  Vj.  To  normalize  the  production,  we  introduce 
several  new  non-terminals  and  productions  to  replace  the  original  production: 

Ground  ::=  edgeVjtoVj  G-edgeVjtoVj 
G-edgeVjtoVj  ::=  G  edgeVjtoVj 

G  ::=  Revld  Ground- Id 
Ground-Id  ::=  Ground  Id 

Figure  11  depicts  this  normalization.  Note  that  edges  labelled  Id  and  RevJd  may  be  ubiquitous;  they 
may  occur  anywhere  in  the  graph.  This  means  that  the  CFL-reachability  Algorithm  may  use  the  above 
productions  and  put  edges  labelled  Ground-Id  and  G  anywhere  in  the  graph.  However,  for  any  given  Vj, 
there  is  only  one  edge  labelled  edgeVjtoVj  in  the  graph;  this  is  the  edge  edgeVjtoVj(Vj,Vj).  This  means 
that  for  a  fixed  Vj,  if  the  CFL-reachability  Algorithm  adds  an  edge  G-edgeVjtoVj(Vi,Vk),  then  it  must 
use  edgeVjtoVj {Vj,Vj)  to  do  so,  and  k  =  j.  That  is,  all  edges  labelled  G-edgeVjtoVj  must  have  node  Vj 
as  their  destination,  although  they  may  have  any  node  as  their  source.  This  in  turn  implies  that  for  a 
fixed  node  Vj,  the  number  of  incoming  edges  of  the  form  G-edgeVjtoVj  (Vi, Vj)  is  bounded  by  0(n),  and 
the  number  of  outgoing  edges  of  the  form  G-edgeVktoVk(Vj ,14}  is  bounded  by  0(n).  Also,  of  all  the  edges 
G-edgeVjtoVj(Vi,Vj) ,  only  one,  G-edgeVjtoVj(Vj,Vj) ,  can  be  combined  with  edgeVjtoVj (Vj,Vj)  to  generate 
Ground(Vj,Vj). 

Now  we  consider  the  following  prototypical  production: 

Id  ::=  Ci  Ground  Id  c~x 

There  are  0(tr)  productions  of  this  form,  one  for  each  position  of  each  different  constructor  type  used  in  the 
constraints.  It  is  normalized  to  the  following  productions: 

Id  ::=  Ci  Ground-Id-c~ 1 
Ground-Id-c ::=  Ground-Id  c~x 
Ground- Id  ::=  Ground  Id 
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The  corresponding  “reverse”  production 


RevJd  ::=  re v-Ct  1  RevJd  Ground  revjCi 


is  normalized  in  a  similar  fashion: 

RevJd  ::=  Rev_c~x -RevJd- Ground  rev_a 
Revjc.^1 -RevJd-  Ground  ::=  rev-C Rev Jd- Ground 
RevJd- Ground  ::=  RevJd  Ground 

Recall  that  Ground  edges  are  always  cyclic.  This  means  that  there  are  at  most  0(n )  edges  with  the  label 
Ground,  and  at  most  0(n2)  edges  with  the  labels  Ground-Id  or  RevJd- Ground.  The  number  of  edges 
with  labels  of  the  form  a,  c~x ,  rev-Ci,  or  rev-cj1  is  bounded  by  0(tr )  (these  edges  are  introduced  only 
when  constructing  the  original  graph).  This  means  that  the  number  of  edges  with  a  label  of  the  form 
Ground-Id-c~ 1  or  rev-cj1- RevJd- Ground  is  bounded  by  0(trn). 

The  production 

Id  ::=  Ground  ae  Id  Id 

is  normalized  to 

Ground- ae  ::=  Ground  ae 
Ground-ae-Id  ::=  Ground- ae  Id 

Id  ::=  Ground-ae-Id  Id 

The  corresponding  “reverse”  production 

RevJd  ::=  RevJd  RevJd  Ground  ae 

is  normalized  to 

Ground- ae  ::=  Ground  ae 
RevJd- Ground- ae  ::=  RevJd  Ground- ae 
RevJd  ::=  RevJd- Ground- ae 

Since  Ground  and  ae  edges  are  always  cyclic,  it  follows  that  Ground- ae  edges  are  always  cyclic.  This  means 
the  number  of  edges  with  the  label  Ground-ae  is  bounded  by  0(t),  which  implies  that  the  number  of  edges 
with  the  labels  Ground-ae-Id  and  RevJd- Ground-ae  are  bound  by  0(tv). 

We  must  also  normalize  productions  having  the  following  form: 

Ground  ::=  edge(k)to\ \  Ground  edgeVjo(k)  edge(k)toV2  Ground  edgeV2to(k) 

...  edge(k)toVr  Ground  edgeVrto(k ) 

There  are  0(t)  productions  of  this  form,  one  for  each  atomic  expression  used  in  each  constraint.  This 
production  is  replaced  by  the  following  productions  (which  are  not  in  normal  form): 

Ground  ::=  M  arkVi  GrAt(k)  MarkV2GrAt(k)  ...  MarkVrGrAt(k) 

MarkViGrAt(k)  ::=  edge(k)toVi  Ground  edgeV2to{k) 

MarkViGrAt{k)  ::=  edge(k)toV2  Ground  edgeV2to(k) 

MarkVrGrAt(k)  ::=  edge(k)toVr  Ground  edgeVrto(k ) 

An  edge  label  of  the  form  MarkVi  GrAt(k)  can  only  appear  on  a  cyclic  edge  MarkVi  GrAt(k){(k),  (k))  at  node 
(k).  Such  an  edge  has  the  effect  of  “Marking  V)  ground  at  node  (fc).”  Productions  of  the  form 

MarkViGrAt(k)  ::=  edge(k)toVi  Ground  edgeVJoik) 


are  normalized  to  the  following  productions: 
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MarkViGrAt(k)  ::=  edge(k)toVi  Ground- edgeVito(k) 

Ground- edgeVito(k)  ::=  Ground  edgeVito(k) 

Finally,  productions  of  the  following  form  must  also  be  normalized: 

Ground  ::=  MarkViGrAt(k)  MarkV2GrAt(k)  ...  MarkVrGrAt(k) 

There  are  0(t )  productions  of  this  form.  It  is  normalized  to  the  following  productions: 

Ground  ::=  MarkVi-VrGrAt(k ) 

MarkVi  -V2  GrAt(k)  ::=  MarkVi  GrAt(k)  MarkV2GrAt(k) 

MarkVi-V3GrAt(k)  ::=  MarkVi-V2GrAt(k)  MarkV3GrAt(k) 


MarkVi  -Vr  GrAt(k)  ::=  MarkVi  -K-i  GrAt(k)  MarkVr  GrAt(k) 

With  these  normalized  productions,  the  CFL-reachability  Algorithm  will  add  at  most  0(tr)  edges  with  labels 
of  the  form  MarkVi -VjGrAt(k)  ( 0(r )  for  each  of  0(t)  productions).  All  of  these  edges  will  be  cyclic. 


4.4.2  Counting  Steps 

Table  3  lists  the  various  forms  of  labels  that  may  appear  in  a  constructed  graph.  For  each  form  of  label,  it 
gives  a  bound  on  the  number  of  edges  with  a  label  of  that  form  (column  2),  and  shows  the  productions  in 
which  a  label  of  that  form  appears  on  the  right-hand  side  (column  3).  Also,  for  each  kind  of  label,  Table  3 
shows  how  many  productions  the  CFL-reachability  Algorithm  may  use  with  a  given  edge  with  that  kind  of 
label  (column  4),  and  how  many  new  edges  the  CFL-reachability  Algorithm  may  attempt  to  produce  as  a 
result  of  examining  that  edge  (column  5) .  (The  latter  is  the  total  for  all  the  productions  the  CFL-reachability 
Algorithm  will  examine.) 

For  example,  consider  the  edge  label  Ground-Id.  There  may  be  0(n2)  edges  labelled  Ground-Id  in  the 
graph.  When  the  CFL-reachability  Algorithm  takes  a  given  edge  of  the  form  Ground-Id(Vj ,  14)  from  its  work- 
list,  it  could  potentially  examine  0(tr)  =  0(t)  productions  of  the  form  Ground-Id-c j-1  ::=  Ground-Id  c j”1, 
in  which  Ground-Id  appears  on  the  right-hand  side.  There  is  one  production  of  this  form  for  every  position 
of  every  different  kind  of  constructor  used  in  the  set-constraint  problem.  When  the  algorithm  considers 
one  of  these  productions,  it  will  look  for  an  edge  of  the  form  c)“1(I4,ym),  in  an  attempt  to  add  the  edge 
Ground- Id-c-1  (Vj ,Vm) .  However,  edges  of  the  form  c^1  (14 ,  Vm)  are  introduced  in  the  graph  to  encode  pro¬ 
jection  constraints;  this  means  that  their  number  is  bounded  by  0(t).  Thus,  over  all  of  the  0(t)  productions 
of  the  form  Id-c~x  ::=  Id  c~x,  the  CFL-reachability  Algorithm  will  find  no  more  than  0(t)  matching  edges 
of  the  form  c~x(Vk,  Vm),  and  so  it  will  add  no  more  than  0(t)  new  edges  as  a  result  of  processing  any  given 
edge  of  the  form  Ground-Id(Vj  ,14) . 

The  accounting  is  more  straightforward  in  most  other  cases.  Table  3  summarizes  the  results.  A  bound 
on  the  amount  of  work  performed  is  found  by  summing  column  4  and  column  5  and  then  multiplying  by 
column  2.  Since  r  is  constant,  and  v  and  n  are  in  the  worst  case  proportional  to  t,  the  total  running  time  of 
the  algorithm  is  bounded  by  0(t3). 


5  Solving  ML  Set-Constraint  Problems  Using  CFL-reachability 

Heintze  has  used  a  modified  class  of  set  constraints  for  set-based  analysis  of  ML  programs  [17].  We  refer 
to  this  class  of  set  constraints  as  ML  set  constraints  to  distinguish  them  from  the  set  constraints  discussed 
in  the  earlier  part  of  the  paper.  (This  class  of  set  constraints  can  be  used  to  express  closure  analysis — 
the  problem  of  determining  the  set  of  abstractions  that  can  reach  an  application — and  hence  related  work 
includes  [35,  47,  45].)  In  this  section,  we  define  ML  set  constraints  and  then  show  how  to  encode  an  ML 
set-constraint  problem  as  a  CFL-reachability  problem. 

5.1  ML  Set  Constraints 

Similar  to  set  expressions  defined  in  Section  2.2.1,  an  ML  set  expression  ( se )  may  be  a  set  variable  or  a 
constructor  of  the  form  c(Vj . . . .  ,Vr).  However,  ML  set  expressions  do  not  have  explicit  projections,  but 
instead  may  also  have  the  following  forms: 
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•  case(Yi,  c(Wi , . . . ,  Wr)  Y2,  W  Y3).  Expressions  of  this  form  are  used  to  model  case  statements. 
The  values  in  Y\  are  matched  against  the  expression  c(V±, . . .  ,Vr).  The  presence  of  a  ground  term  of 
the  form  c(v  1, ...  ,vr)  in  Yi  indicates  that  Vi  G  Vi  for  i  =  1 . . .  r  and  that  the  values  of  the  entire  case 
expression  are  a  superset  of  the  values  in  Yi-  W  represents  the  default  branch  of  the  case  expression. 
It  contains  a  superset  of  the  values  in  Yi  that  are  ground  terms  not  of  the  form  c(vi, . . . ,  vr).  The 
presence  of  a  ground  value  in  Yi  that  is  not  a  c  term  indicates  that  the  values  of  the  entire  case 
expression  are  a  superset  of  the  values  in  Y3 . 

Note  that  the  value  decomposition  feature  of  case  expressions  serves  as  a  replacement  for  the  projection 
operators  of  the  set  expressions  described  is  Section  2.2.1. 

•  “Abstraction  constants”  of  the  form  Xx.e.  In  program-analysis  problems,  such  constants  typically  play 
a  role  in  modeling  function  abstractions:  Each  abstraction  constant  is  manufactured  from  a  function 
abstraction  in  the  program  (e.g.,  the  x  and  e  in  abstraction  constant  Xx.e  are  derived  in  some  fashion 
from  the  textual  definition  of  a  function  abstraction  in  the  program) .  The  e  part  of  abstraction  constant 
Xx.e  serves  as  a  tag  to  distinguish  this  abstraction  constant  from  other  abstraction  constants  of  the 
set-constraint  problem.  The  x  part  of  abstraction-constant  Xx.e  serves  to  link  Xx.e  to  two  associated 
set  variables: 

—  Vx,  which  holds  a  superset  of  all  the  values  that  may  bind  to  x  during  program  execution. 

—  RangeXx  ,  which  represents  the  range  of  Xx.e.  It  holds  a  superset  of  all  the  values  that  Xx.e  may 
return  during  program  execution. 

In  program-analysis  problems,  one  would  typically  standardize  names  apart,  so  that  each  two  different 
abstraction  constants  Xx.e  and  Xy.e'  of  the  set-constraint  problem  would  use  different  variable  names 
(x,  y ,  etc.). 

•  apply(se  1,  se 2).  Expressions  of  this  form  are  used  to  model  function  application. 

•  ifnonempty(se  1,  se2).  Expressions  of  this  form  do  not  directly  correspond  to  any  language  construct. 
They  are  used  to  make  set  based  analysis  more  accurate  by  preventing  constraints  that  correspond  to 
certain  infeasible  execution  configurations  from  contributing  to  the  solution  [17,  43]. 

ML  set  constraints  are  of  the  form  V  D  se,  where  se  is  an  ML  set  expression.  A  solution  to  a  collection  of 
ML  set  constraints  is  a  mapping  from  set  variables  to  a  set  of  values  such  that  the  constraints  are  satisfied. 
In  this  case  a  “value”  may  be  an  abstraction  Xx.e  as  well  as  a  ground  term  composed  of  constructors.  Given 
a  mapping  1  from  set  variables  to  sets  of  values,  the  mapping  can  be  extended  to  map  set  expressions  to 
sets  of  values  as  follows: 

•  . .  .,Vr))  =  {c(v  1, . . . ,  vr)\v\  G  X(Vi),  ...,vr€  X(Vr)} 

•  X(Xx.e)  =  {Xx.e} 

•  I(ifnonempty{sei,  se2))  = 

if  X(sei)  =  {}  then  {}  else  X(se 2); 

•  X(apply(se  1,  se-2))  =  {u  :  Xx.e  G  X(se  1)  A  I(se 2)  yf  {}  A  v  G  X{RangeXx  e))} 
provided  Xx.e  G  X(se  1)  implies  X{se2)  C  X(Vx) 

•  X(case(se  1,  c(X  1, . . . ,  Xn)  c->  se2,Y  sea))  =  Si  U  S2,  where 

1.  Si  =  {u  :  v  G  X(se2)  A  3c(vi, .  ..,vn)  G  X(se  1)} 

2.  S2  —  {v  :  v  £  X(se 3)  A  3c' (nj, . . . ,  vn)  G  X(se  1)  s.t.  c'  yf  c} 

3.  For  all  d  7^  c,  c(v  1, . . . ,  vn)  G  X(se  1)  implies  Vi  G  X(Xi),i  =  1 . .  .n 

4.  c'(v  1,  ...,vn)  G  X(se  1)  implies  d(v  1,  ...,vn)  G  X(Y) 

Note  that  it  is  possible  for  an  expression  to  be  undefined  in  a  given  mapping.  This  can  happen  if  the 
mapping  X  does  not  meet  the  requirements  for  interpreting  the  expression.  A  solution  X  to  a  collection  of 
constraints  C  must  define  each  set  expression  used  in  C. 
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5.1.1  Solving  ML  Set  Constraints 

ML  set  constraints  with  the  following  form  are  said  to  be  in  explicit  form: 


V  D  Vi 

V  2  c(Vi,  ...,Vr) 

V  D  Xx.e 

As  before,  a  collection  of  ML  set  constraints  C  is  solved  by  augmenting  the  collection  with  constraints  in 
explicit  form  until  no  more  can  be  added.  The  constraints  in  explicit  form  can  then  be  taken  to  be  a  regular 
term  grammar  that  represents  the  least  solution  to  the  constraints.  The  ML-SC-Reduction  Algorithm  is 
defined  below.  Groundness  is  defined  as  in  Section  2.2.3. 

Algorithm  5.1  (ML-SC-Reduction  Algorithm)  Given  a  collection  of  ML  set  constraints  C,  the  following 
steps  are  repeated  until  neither  step  causes  C  to  change: 

1.  if  X  D  apply  (Xi,  X2)  and  X\  D  Xx.e  both  appear  in  C  then 

(a)  add  the  constraint  X  D  Range  Xx  e  to  C 

(b)  add  the  constraint  Vx  D  X2  to  C 

2.  if  A  D  case(Y1,  c(W\, . . . ,  Wr)  c~>  I2,  W  I3)  and  Y\  D  c(Z1} . . . ,  Zr)  both  appear  in  C  and  the 
expression  c(Z  1, . . . ,  Zr)  is  ground  then 

(a)  add  the  constraint  X  D  Y2  to  C 

(b)  for  i  =  1 . . .  n.  add  the  constraint  Wi  3  Zi  to  C 

3.  if  X  D  case(Yi,  c(W±, ... ,  Wr)  ‘—>•12,  W  I3)  and  Y1  D  d(Z±, . . . ,  Zr)  both  appear  in  C  where  d  7^  c 
and  the  expression  d{Z\, . . . ,  Zr)  is  ground  then 

(a)  add  the  constraint  X  D  Y3  to  C 

(b)  add  the  constraint  W  D  d(Z  1, . . . ,  Zr)  to  C 

4.  if  X  D  ifnonempty(Yi  ,Y2)  appears  it  C  and  Yi  is  ground  then  add  X  D  Y2  to  C 

5.  if  X  D  X1  and  X'  D  se  both  appear  it  C,  where  X'  D  se  is  in  explicit  form,  then  add  X  D  se  to  C 

When  no  more  constraints  can  be  added,  the  constraints  in  explicit  form  are  converted  to  a  regular  term 
grammar;  this  describes  the  least  solution  [17].  □ 

5.2  Solving  ML  Set-Constraint  Problems  Using  CFL-reachability 

The  idea  for  encoding  an  ML  set-constraint  problem  is  the  same  as  in  Section  4.1:  we  view  the  ML  SC- 
Reduction  Algorithm  as  computing  what  atomic  expressions  reach  each  set  variable  and  construct  a  CFL- 
reachability  problem  that  computes  the  same  information.  The  constructed  graph  contains  a  node  for  each 
atomic  expression  and  a  node  for  each  set  variable.  Where  the  ML  SC-Reduction  Algorithm  produces  the 
explicit  constraint  13  ae,  the  constructed  CFL-reachability  problem  induces  an  identity  path  from  the  node 
representing  atomic  expression  ae  to  the  node  representing  the  set  variable  V. 

In  the  rest  of  this  section,  we  first  describe  how  to  construct  a  graph  to  encode  a  collection  of  ML 
set  constraints.  Then  we  show  what  productions  are  used  to  encode  the  steps  of  the  ML  SC-Reduction 
Algorithm  for  a  given  collection  of  ML  set  constraints.  The  techniques  for  handling  groundness  information 
in  the  problem  constructed  here  is  the  same  as  in  Section  4.1.2.  As  in  Section  4.1.2,  for  every  edge  from 
node  i  to  node  j,  we  need  a  corresponding  reverse  edge  from  node  j  to  node  i.  To  simplify  of  presentation, 
we  will  not  explicitly  list  the  reverse  edges  (nor  the  productions  that  generate  them),  but  we  assume  that 
they  are  also  produced. 
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5.2.1  Encoding  ML  Set  Constraints 

Given  a  collection  of  constraints  C,  the  graph  encoding  these  constraints  is  constructed  as  follows: 

•  For  each  set  variable  Vi,  the  graph  contains  a  node  labelled  Vi,  and  an  edge  edgeVitoVi(Vi,Vi) . 

•  Each  atomic  expression  c(Vi, . . . ,  Vr)  used  in  a  constraint  of  the  form  V  D  c(Vi, . . . ,  Vr)  is  associated 
with  a  unique  index. 

Given  an  expression  c(Vi, . . . ,  Vr)  with  index  k,  the  graph  contains  a  node  labelled  (k),  and  the  edges 
ae{(k),  (k))  and  c-value((k),  ( k )).  The  edge  c-value((k),  (k))  indicates  that  the  node  (k)  can  represent 
a  c  value.  (The  node  (k)  actually  represents  a  c- value  iff  c(V ±, . . . ,  Vr)  is  ground;  thus  it  is  really  the 
presence  of  a  pair  of  edges  c-value((k),  (k))  and  Ground((k),  (k))  that  indicates  that  (k)  is  known  to 
be  a  ground  c  term.)  If  c  is  a  nullary  constructor,  then  the  graph  contains  the  edge  Ground((k),  ( k )). 
Otherwise,  for  each  position  j  of  the  atomic  expression  (where  j  =  1  ...r),  the  graph  contains  the 
edges  Cj{Vj,(k)),  edge(k)toVj((k),Vj),  and  edgeVjto(k){Vj,(k)). 

•  For  each  constraint  of  the  form  V  D  c(V i, . . . ,  Vr),  where  the  expression  c(Vi, . . . ,  Vr)  has  index  k,  the 
graph  contains  an  edge  Id{(k),V)  indicating  that  the  atomic  expression  c(Vi, . . . ,  Vr)  reaches  V. 

•  For  each  atomic  expression  Xx.e,  the  graph  contains  a  node  labelled  Xx.e  and  an  edge  Ground( Xx.e,  Xx.e ). 
The  node  Xx.e  is  connected  to  the  nodes  representing  the  set  variables  Vx  and  Range A  by  the  edges 
input- { Xx.e,Vx)  and  retum(RangeXx  e,  Xx.e)  (See  Figure  12(a)).  An  edge  input-  {Xx.e,  Vx)  indicates 
that  values  to  which  the  abstraction  Xx.e  is  applied  reach  the  variable  Vx .  An  edge  return(RangeXx e,  Xx.e) 
indicates  that  RangeXx e  holds  a  superset  of  the  values  returned  by  the  abstraction  Xx.e  during  program 
execution. 

•  For  each  constraint  of  the  form  VO  apply(V i,  V2),  the  graph  contains  the  following  edges: 

—  return- (Vi,  V).  This  edge  indicates  that  V  contains  values  that  are  returned  by  abstractions  in 
Vl 

—  input {V2,  Vi).  This  edge  indicates  that  values  in  Vi  are  potential  arguments  of  abstractions  in  V±. 
See  Figure  12(b). 

•  For  each  constraint  of  the  form  V  D  Xx.e,  the  graph  contains  an  edge  Id(Xx.e,V). 

•  For  each  constraint  of  the  form  ViOVj,  the  graph  contains  an  edge  Id{Vj,Vi). 

•  For  each  constraint  of  the  form  X  D  case(Y1,  c(Wi, . . . ,  Wr)  Yi,  W  •— >  Y3),  the  graph  contains  the 
following  edges: 
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1.  Cj  1  { Y\ ,  Wi)  where  i  =  1 . . .  r 

2.  non-c-vals(Yi,  W) 

3.  edgeY2toY1{Y2,Y1 ) 

4.  edgeY3toY1(Y3,  Y±) 

5.  edgeY1toX(Y1,X) 

Figure  12(c)  illustrates  point  1  above,  and  Figure  12(d)  illustrates  points  2  thru  5. 

•  For  each  constraint  of  the  form  X  D  ifnonempty(Yi,  Y2)  the  graph  contains  the  edges  edgeY2  to  Yi{  Y2,  Yi) 
and  edge  Y±  toX(  Y± ,  X) 

5.2.2  Encoding  the  ML  SC-Reduction  Algorithm 

The  productions  used  to  encode  the  ML  SC-Reduction  Algorithm  are  a  superset  of  the  productions  used 
to  encode  the  SC-Reduction  Algorithm.  The  productions  introduced  in  Section  4.1.2  are  again  used  to 
propagate  groundness  information.  For  each  node  representing  a  variable  V),  there  is  a  production 

Ground  ::=  edgeVitoVi  Rev-Id  Ground  Id  edgeVitoVi 

For  each  atomic  expression  of  the  form  c(Vi, . . . ,  Vr )  with  index  k.  the  context-free  grammar  contains  the 
following  production: 

Ground  ::=  edge{k)toV\  Ground  edgeVito(k) 
edge(k)toV2  Ground  edgeV2to(k) 

edge(k)toVr  Ground  edgeVrto(k ) 

In  Section  3,  the  production  Id  ::=  Ground  ae  Id  Id  encodes  Step  2  of  the  SC-Reduction  Algorithm; 
now  we  use  it  to  encode  Step  5  of  the  ML  SC-Reduction  Algorithm.  Similarly,  productions  of  the  form 

Id  ::=  a  Ground  Id  c-1 

were  used  earlier  to  encode  Step  1  of  the  SC-Reduction  Algorithm;  now  they  encode  the  actions  taken  by 
Step  2(b)  of  the  ML  SC-Reduction  Algorithm.  (See  Figure  14(a).) 

New  productions  are  needed  to  encode  Steps  1,  2(a),  3,  and  4  of  the  ML  SC-Reduction  Algorithm.  In 
the  following  examples,  we  introduce  the  productions  used  to  encode  these  steps. 

Example  5.1  Consider  the  following  constraints: 

X  D  apply (V1:V2) 

Vi  D  Xx.e 

Given  these  constraints,  Step  1(a)  of  the  ML  SC-Reduction  Algorithm  introduces  X  D  RangeXxe.  This 
constraint  is  added  because  the  result  of  the  apply  expression  includes  values  in  the  range  of  any  abstraction 
that  reaches  V3 .  In  the  constructed  graph,  the  presence  of  the  edge  Id(Xx.e,Vi)  indicates  that  the  abstrac¬ 
tion  Xx.e  reaches  V±.  To  simulate  the  actions  of  Step  1(a),  the  CFL- reachability  algorithm  uses  the  edges 
Id( Xx.e,V\)  and  return- {V \,X),  and  the  production  Id  ::=  return  Id  return (See  Figure  13(a).) 

Given  the  above  constraints,  Step  1(b)  of  the  ML  SC-Reduction  Algorithm  introduces  the  constraint 
Vx  D  V2:  the  semantics  of  the  apply  expression  demand  that  for  any  abstraction  Xx.e  that  reaches  V),  the 
values  in  V2  should  reach  X.  This  is  simulated  in  the  CFL- reachability  Algorithm  by  the  edges  input( V2,V\), 
Rev-Id{Vi,  Xx.e),  and  input- (Xx.e,  Vx)  and  the  production  Id  ::=  input  RevJd  input-.  (See  Figure  13(b).) 

□ 


In  the  following  example,  we  introduce  productions  for  encoding  Step  2(a)  and  Step  3  of  the  ML  SC- 
Reduction  Algorithm. 

Example  5.2  Consider  the  following  constraints: 
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Figure  13:  Graphs  showing  edge  induction  for  constraints  of  the  form 
X  D  apply (V, 1,14)  and  Vi  D  Xx.e. _ 


1.  ID  case(Yu  cons{Wu  W2)  >  Y2,  W  Ys) 

2.  Yi  D  cons (Vi,V2) 

3.  Yi  D  succ(Zi) 

Let  cons(\ i,V2)  and  succ(Zi)  have  indices  j  and  k,  respectively,  and  suppose  both  expressions  are  ground. 
Figure  14  shows  the  graph  constructed  to  represent  the  above  constraints  (and  many  subgraphs  of  this 
graph).  The  features  of  this  graph  are  explained  below. 

Step  2(a)  of  the  ML  SC-Reduction  Algorithm  introduces  the  constraint  X  D  Y2  iff  a  ground  cons 
expression  reaches  Yi :  in  this  example,  X  D  Y2  is  introduced  because  of  the  constraint  Y\  3  cons(V±,  V2), 
and  the  assumption  that  cons(V±,  V2)  is  ground.  In  the  constructed  graph,  a  node  (m)  represents  a  ground 
cons  expression  iff  the  graph  contains  both  of  the  edges  Ground{(m),(m))  and  cons-value{(m),(m)).  To 
encode  the  actions  of  Step  2(a)  on  the  above  case  constraint,  the  constructed  graph  contains  the  edges 
edgeY2toYi(Y2,  Yi)  and  edgeYitoX(Yi,  X)  and  the  grammar  contains  the  following  production: 

Id  ::=  edgeY2toYi  RevJd  cons-value  Ground  cons-value  Id  edgeYitoX 

(See  Figure  14(b).)  (Note,  the  reason  this  production  has  two  occurrences  of  the  terminal  symbol  cons-value 
has  to  do  with  limiting  the  possible  blow-up  in  the  running  time  required  to  solve  the  constructed  CFL- 
reachability  problem.  This  feature  will  be  explained  in  Section  5.3.  The  production  is  still  correct  if  either 
of  these  terminals  is  removed.) 

Given  the  above  constraints,  Step  3(a)  of  ML  SC-Reduction  Algorithm  introduces  the  constraint  X  3  Y$ 
iff  a  ground  expression  of  the  form  c(V i, . . . ,  Vr)  reaches  Yi,  where  c  ^  cons;  in  this  case,  the  constraint 

Yi  D  succ(Zi)  and  the  assumption  that  succ(Zi)  is  ground  mean  that  X  D  Y3  is  generated.  In  the 
constructed  graph,  the  edges  Ground{(m),(m ))  and  c-value((rn),  {m ))  indicate  that  node  (to)  represents  a 
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ground  expression  of  the  form  c(V [, . . . ,  Vr).  To  encode  the  actions  of  Step  3(a)  on  the  above  case  constraint, 
the  graph  contains  the  edges  edgeYstoYi(Ys,  Yi)  and  edgeYJoX{  Yi,  X)  and  the  grammar  contains  the 
following  productions: 

Id  ::=  edgeY2toYi  RevJd  c-value  Ground  c-value  Id  edgeYitoX 

for  each  constructor  c  such  that  c  ^  cons.  (See  Figure  14(c).)  Note  that  there  is  one  production  of  this  form 
for  each  constructor  type  for  each  case  constraint.  This  means  that  the  construction  is  no  longer  linear  in 
time,  but  its  running  time  is  bounded  by  0(t3).  (See  Section  5.3.) 

Step  3(b)  of  the  ML  SC-Reduction  Algorithm  allows  ground  atomic  expressions  of  the  form  c(Vi, . . . ,  Vr) 
to  pass  from  Yi  to  W  iff  c  ^  cons.  In  the  current  example,  the  constraint  W  D  succ(Zi)  is  introduced.  To 
encode  Step  3(b),  we  use  the  edge  non-cons-vals(Yi,  IF),  and  constraints  of  the  form 

Id  ::=  Ground  c-value  Id  non-cons-vals 

for  each  constructor  type  c  such  that  c  yf  cons.  (See  Figure  14(d).)  □ 

Finally,  we  must  encode  the  action  taken  by  Step  4  of  the  ML  SC-Reduction  Algorithm  on  a  constraint 
of  the  form  X  D  ifnonempty(Yi,  Y2).  This  is  done  using  the  edges  edgeY2toYi  and  edgeYitoX  and  the 
following  production: 

Id  ::=  edgeY2toYi  Ground  edgeYitoX 

As  in  Section  4,  the  regular  term  grammar  that  is  the  solution  to  the  ML  set-constraint  problem  can  be 
obtained  from  the  solution  to  the  constructed  CFL-reachability  problem  by  examining  Id  edges.  For  each 
Id  edge  from  a  node  representing  an  atomic  expression  ae,  to  a  node  representing  a  variable  V.  the  regular 
term  grammar  contains  a  production  of  the  form  V  =>  ae. 

5.3  Cost  of  Solving  the  Constructed  CFL-Reachability  Problem 

As  with  the  construction  described  in  Section  4.4,  when  we  plug  the  various  parameters  that  characterize  the 
size  of  the  constructed  CFL-reachability  problem  into  the  standard  formula  for  the  worst-case  asymptotic 
running  time  of  CFL-reachability,  we  have  not  preserved  the  0(t3)  bound  on  the  time  to  solve  ML  set- 
constraint  problems.  In  this  section,  by  an  argument  similar  to  that  used  in  Section  4.4,  we  show  that  the 
constructed  CFL-reachability  problem  can  indeed  be  solved  in  0(t3). 

Below,  we  first  discuss  why  it  is  necessary  to  repeat  terminal  symbols  in  some  of  the  productions  presented 
in  the  Section  5.2.  In  Section  5.3.2,  we  list  the  normalizations  of  the  productions  that  are  new  to  Section  5. 
Finally,  Table  4  summarizes  the  work  done  for  each  edge  added  by  the  CFL-reachability  Algorithm  while 
solving  a  problem  constructed  from  an  ML  set-constraint  problem. 

5.3.1  Repeating  Terminal  Symbols 

In  Section  5.2,  we  introduced  some  productions  that  have  seemingly  unnecessary  repetitions  of  some  terminal 
symbols.  In  particular,  a  production  of  the  form 

Id  ::=  edgeYstoYi  RevJd  c-value  Ground  c-value  Id  edgeYitoX 

causes  the  CFL-reachability  Algorithm  to  induce  an  Id  edge  exactly  when  the  production 

Id  ::=  edgeYJoYi  RevJd  Ground  c-value  Id  edgeYitoX 

causes  the  CFL-reachability  Algorithm  to  induce  an  Id  edge.  This  follows  from  the  fact  that  the  labels  c-value 
and  Ground  always  appear  on  cyclic  edges.  However,  while  the  productions  are  functionally  equivalent, 
every  normalization  of  the  latter  production  either  introduces  a  non-terminal  that  might  label  0(tn2)  edges 
and  participate  in  0(t )  productions,  or  introduces  a  non-terminal  that  might  appear  on  0(tn )  edges  and 
participates  in  0(t2)  productions.  Either  way,  the  bound  on  the  running  time  of  the  CFL-reachability 
Algorithm  increases  to  0(t4). 

Adding  the  second  c-value  allows  us  to  find  a  normalization  that  avoids  this  blowup.  To  see  why,  let  us 
examine  in  more  detail  what  goes  wrong  when  the  production 
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Figure  15:  Graph  showing  the  need  to  double  terminals  in  some 
productions.  The  bold  edges  are  used  by  the  normalized  produc¬ 

tion  Id  ::=  EdgeY2toYi- RevJd  Ground-c-Id- edgeYitoX]  note  that  the  edge 
EdgeY2toYi-RevJd{Ys,  (k))  may  be  used  in  0(t2)  productions  of  this  form. 
However,  suppose  the  edge  EdgeY2toYi-RevJd(Ys,(k))  is  first  paired  with 
the  edge  c  —  value{(k) ,  (k))  to  generate  the  edge  EdgeY2toYi-RevJd-c(Ys,(k)). 
An  edge  of  this  form  can  be  used  with  only  0(t)  productions  of  the  form 
Id  ::=  EdgeY2toYi-RevJd-c  Ground-c-Id- edgeYitoX  to  generate  the  same  Id 
edges. 

Id  ::=  edgeYJoYi  RevJd  Ground  c- value  Id  edgeYitoX 
is  normalized  to  the  productions: 

1.  Ground-c  ::=  Ground  c-value 

2.  Ground-c-Id  ::=  Ground-c  Id 

3.  Ground-c- Id- edgeYitoX  ::=  Ground-c-Id  edgeYitoX 

4.  EdgeYJoYi- RevJd  ::=  edgeY2toYi  RevJd 

5.  Id  ::=  EdgeYJoYi- RevJd  Ground-c-Id-edgeYJoX 

Notice  that  there  are  0(t)  productions  of  the  form  of  the  fifth  production  for  each  of  0(t)  different  constructor 
types.  The  problem  with  this  normalization  is  with  the  non-terminal  edgeYJoYi- RevJd  and  the  fifth 
production.  There  may  be  0(tn)  edges  labelled  with  this  non-terminal,  each  involved  in  0(t2)  productions 
of  the  form  of  the  fifth  production  above.  Consider  a  particular  edge  edgeY2toYi-RevJd(i,  j)  that  has 
node  j  as  its  target.  There  can  be  at  most  one  edge  of  the  form  d-value(j,j)  for  at  most  one  constructor 
type  d.  For  all  edges  labelled  Ground-d'-Id-edgeYJoX  that  leave  node  j,  it  must  be  the  case  that  d  = 
d'.  This  means  that  there  can  be  a  maximum  of  0(t )  edges  which  leave  node  j  and  have  a  label  of  the 
form  Ground-d' -Id-edgeYJoX.  This  implies  that  when  the  CFL-reachability  Algorithm  examines  the  edge 
edgeY2toYi-RevJd{i,  j)  and  looks  at  0(t2)  productions,  all  but  0(t )  of  these  productions  cause  the  CFL- 
reachability  Algorithm  to  search  for  a  second  edge  that  cannot  exist. 

In  contrast,  the  production 

Id  ::=  edgeYJoYi  RevJd  c-value  Ground  c-value  Id  edgeYitoX 
can  be  normalized  to 
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1.  Ground-c  ::=  Ground  c-value 

2.  Ground-c-Id  ::=  Ground-c  Id 

3.  Ground-c- Id- edgeYitoX  ::=  Ground-c-Id  edgeYitoX 

4.  EdgeY2toYi- RevJd  ::=  edgeY2toYi  RevJd 

5.  EdgeY2toYi-RevJd-c  ::=  EdgeY2toYi- RevJd  c-value 

6.  Id  ::=  EdgeY2toY\-RevJd-c  Ground-c-Id-edgeYitoX 

In  this  normalization,  the  nonterminal  EdgeY2toYi-RevJd-c  may  appear  on  0(t2)  edges,  but  it  participates 
in  only  0(t)  productions  of  the  sixth  form.  In  effect,  production  five,  of  which  there  are  only  0(f)  for  a 
given  edge  of  the  form  EdgeY2toYi-RevJd(i,j),  forces  the  CFL- reachability  Algorithm  to  determine  what 
constructor  type,  if  any,  is  represented  at  node  j  before  it  starts  to  consider  productions  that  include  the 
non-terminal  Ground-c-Id-edgeYitoX.  See  Figure  15. 

5.3.2  Normalization  of  the  Constructed  Grammar 

Normalization  of  the  context-free  grammar  in  a  constructed  problem  is  done  as  in  Section  4.4.1.  In  fact, 
since  the  productions  used  to  encode  the  ML  SC-Reduction  Algorithm  are  a  superset  of  the  productions  used 
to  encode  the  SC-Reduction  Algorithm,  all  of  the  normalizations  from  Section  4.4.1  are  needed  for  a  CFL- 
reachability  problem  constructed  from  an  ML  set-constraint  problem;  these  normalizations  are  not  repeated 
here.  We  also  do  not  show  the  normalization  of  “reverse”  productions  that  have  RevJd  on  their  left-hand 
side;  the  normalization  of  a  reverse  production  is  the  reverse  of  the  normalization  for  the  corresponding 
forward  production. 

The  normalizations  of  the  productions  new  to  this  section  are  as  follows: 

•  Id  ::=  input  RevJd  input~  is  normalized  to 

Input-RevJd  ::=  input  RevJd 
Id  ::=  Input-RevJd  input~ 

•  Id  ::=  return  Id  return ~  is  normalized  to 

Return-Id  ::=  return  Id 
Id  ::=  Return-Id  return ~ 

•  Id  ::=  Ci  Ground  Id  c^1  is  normalized  to 

Ground-Id  ::=  Ground  Id 
Ci-Ground-Id  ::=  c;  Ground-Id 
Id  ::=  Ci-Ground-Id  c^1 

•  Id  ::=  edgeY2toYi  c-value  RevJd  Ground  c-value  Id  edgeYitoX  is  normalized  to 

Ground-c  ::=  Ground  c-value 
Ground-c-Id  ::=  Ground-c  Id 

Ground-c-Id- edgeYitoX  ::=  Ground-c-Id  edgeYitoX 
EdgeY2toYi-RevJd  ::=  edgeY2toYi  RevJd 
EdgeY2toYi-RevJd-c  ::=  edgeY2toYi-RevJd  c-value 
Id  ::=  EdgeY2toYi-RevJd-c  Ground-c-Id- edgeYitoX 

•  Id  ::=  Ground  c'-value  Id  non-c-value  is  normalized  to 

Ground-c1  ::=  Ground  c'-value 
Ground-c' -Id  ::=  Ground-c'  Id 
Id  ::=  Ground-c' -Id  non-c-values 
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•  Id  ::=  edgeY2toYi  Ground  edgeYitoX  is  normalized  to 

Ground-edgeYitoX  ::=  Ground  edgeYitoX 
Id  ::=  edge Y2 to Yi  Ground- edgeYitoX 

Table  4  together  with  Table  3  lists  the  costs  entailed  by  the  processing  steps  of  the  algorithm  for  solving 
CFL-reachability  problems  from  Section  2.1.1.  A  bound  on  the  amount  of  work  performed  is  found  by 
summing  column  4  and  column  5  and  then  multiplying  by  column  2.  Since  r  is  constant,  and  v,  k,  and  n 
are  in  the  worst  case  proportional  to  t,  the  total  running  time  of  the  algorithm  is  bounded  by  0(t3). 

6  Solving  CFL-Reachability  Problems  Using  ML  Set  Constraints 

In  this  section,  we  discuss  how  ML  set  constraints  can  be  used  to  solve  CFL-reachability  problems.  First 
note  that  a  projection  constraint  of  the  form 

UDcr^V) 

from  the  class  of  set  constraints  presented  in  Section  2.2  can  be  modelled  by  the  ML  set  constraint 
U  A  case(V,  c(Ti, . . . ,  Tj, . . . ,  Tr)  c-^  Ti ,  X  c— >  Y ) 

where  Tj  ...Tr,  X,  and  Y  are  new  variables.  Note  that  to  have  the  same  semantics  as  the  projection 
constraint,  it  is  important  that  Y  map  to  the  empty  set;  otherwise,  values  from  Y  may  reach  U,  which  is 
not  part  of  the  semantics  of  the  projection  constraint. 

By  replacing  projections  with  case  expressions  in  this  fashion,  the  construction  in  Section  3  becomes  a 
transformation  from  CFL-reachability  problems  to  ML  set-constraint  problems.  The  run  time  for  an  ML 
set-constraint  problem  constructed  in  this  way  has  a  higher  constant  of  proportionality  than  a  constructed 
set-constraint  problem  from  Section  3,  although  the  asymptotic  run  time  is  the  same.  In  particular,  the 
construction  from  Section  3,  a  constraint  of  the  form  Rchd^B- 1  ^  3  C (V)  may  pair  with  at  most  O(s) 

projection  constraints  of  the  form  Ds\a,{]  3  Cy1  {Rchd^B- 1  where  s  is  the  number  of  symbols  of  the 
context-free  grammar  of  the  original  CFL-reachability  problem. 

In  a  constructed  ML  set-constraint  problem,  a  constraint  of  the  form  RchdyB- 1  ^  3  G(V)  may  match  at 
most  O(s)  case  constraints  of  the  following  form: 

Dst[A,i]  3  case(Rchd^B-i  C(T)  <-^T,X  Y) 

However,  the  constraint  Rchd^B- 1  ^  3  C(V)  may  also  match  the  “default”  case  of  as  many  as  0(s2)  case 
constraints  of  the  form 

Dst[A,i]  3  case(Rchd[B- 1  D(T)  T,  X  Y) 

where  D  ^  C.  This  means  that  the  time  needed  to  solve  a  constructed  ML  set-constraint  problem  may  be 
0(s4n3),  where  n  is  the  number  of  nodes  in  the  original  CFL-reachability  problem.  (The  time  needed  to 
solve  a  constructed  set-constraint  problem  from  Section  3  is  0(s3n3).)  Since  s  is  a  constant  independent  of 
the  input,  the  total  run  time  is  still  bounded  by  0(n3). 

Of  course,  it  is  also  possible  to  optimize  ML  set  constraints  to  allow  “don’t  care”  defaults  that  will  not 
match  anything.  If  this  is  done,  the  runtime  for  a  constructed  ML  set-constraint  problem  is  the  same  as  the 
runtime  for  a  constructed  set-constraint  problem. 

7  Related  Work  and  Concluding  Remarks 

7.1  Broader  Classes  of  Set  Constraints 

This  paper  has  presented  interconvertibity  results  for  context-free  reachability  problems  and  two  classes 
set-constraints.  However,  the  problem  of  satisfiability  for  some  classes  of  set  constraints  is  NEXPTIME- 
complete  [49,  7].  Since  CFL-reachability  is  PTIME-complete  [1,  38,  48],  it  is  impossible  to  use  CFL- 
reachability  to  cover  these  classes  of  set  constraints  (and  it  is  unclear  whether  one  can  develop  a  more 
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powerful  graph-reachability  techniques  that  would  handle  them).  It  is  also  not  clear  that  CFL-reachability 
can  be  used  to  model  classes  of  set  constraints  in  which  intersection  or  negation  is  allowed. 

7.1.1  Contravariant  Set  Constraints 

We  now  sketch  how  the  construction  given  in  Section  4  can  be  modified  to  handle  constructors  that  have 
contravariant  fields.  In  a  class  of  set  constraints  that  uses  contravariance,  each  constructor  has  a  signature 
that  indicates  whether  each  field  of  the  constructor  is  contravariant  or  covariant.  In  place  of  projections, 
there  is  a  reduction  rule  that  reduces  a  constraint  of  the  form 

c(uu...,  Ur)  d  c(ij , . . . ,  v;.) 

to  the  following  constraints: 

Ui^Vi  for  all  i  such  that  c  is  covariant  in  field  i 
Vj  3  Uj  for  all  j  such  that  c  is  contravariant  in  field  j 

(Note,  that  in  the  class  of  set  constraints  discussed  in  this  paper,  constraints  of  the  form  c(U±, . . . ,  Ur)  3 
c(\ j, . . . ,  Vr)  are  not  permitted;  a  system  that  uses  contravariant  constraints  should  allow  constraints  of  this 
form,  and  might  also  allow  constraints  of  the  form  c(Ui, . . . ,  Ur )  3  A".) 

Contravariance  can  be  modeled  by  CFL-reachability  by  including  the  following  elements  in  the  construc¬ 
tion  of  the  CFL-reachability  problem: 

•  Each  atomic  expression  c(L j, . . . ,  Vr)  used  in  the  constraints  is  associated  with  a  unique  index.  As  in 
Sections  4  and  5,  we  refer  to  refer  to  an  atomic  expression  by  its  index  rather  than  by  writing  out  the 
expression. 

For  each  atomic  expression  c(Vj, . . . ,  Vr)  with  index  k,  the  graph  contains  a  node  labeled  (k)  and  the 
graph  contains  the  following  edges: 

Ci(Vi,  ( k ))  for  all  i  such  that  c  is  covariant  in  field  i 

cj“1((A;),  Vi)  for  all  i  such  that  c  is  covariant  in  field  i 

contra-Cj(Vj,  (k))  for  all  j  such  that  c  is  contravariant  in  field  j. 

contra-cj1  ((k) ,Vj)  for  all  j  such  that  c  is  contravariant  in  field  j. 

In  addition,  for  each  of  the  above  edges,  the  graph  contains  the  corresponding  reverse  edge.  For  exam¬ 
ple,  if  the  graph  contains  the  edge  Ci(V j,  (k)),  then  the  graph  also  contains  rev_a((k),  Vi).  (Depending 
on  the  constraint  system  being  modeled  and  the  other  aspects  of  the  constructed  CFL-reachability 
problem,  some  of  these  reverse  edges  may  be  unnecessary.  For  example,  it  may  be  possible  to  use  the 
edge  cj”  {(fc),  Vi)  in  place  of  the  edge  revjCi{(k),Vi).) 

•  For  any  constraint  of  the  form  c(  b\ .....  Ur)  3  c(V f, . . . ,  Vr),  where  the  expression  c(  (/, Ur)  has 
index  j  and  the  expression  c(Vi, . . . ,  Vr)  has  index  k,  the  graph  contains  the  edges  Id((k),(j))  and 
RevJd((j),  (fc)). 

•  For  each  constructor  c,  the  grammar  contains  the  following  productions: 

Id  ::=  Ci  Id  c~x  for  all  i  such  that  c  is  covariant  in  field  i. 

Id  ::=  contra-Cj  RevJd  contra.cj 1  for  all  j  such  that  c  is  contravariant  in  field  j. 

In  addition,  the  grammar  should  contain  the  corresponding  “reverse”  productions  that  have  Rev-Id  on 
their  left-hand  side. 

Example  7.1  Let  the  binary  constructor  abs  be  contravariant  in  its  first  field,  and  covariant  in  its  second 
field. 

The  constructor  abs  can  be  used  in  set  expressions  to  represent  a  functional  abstraction  \x.e  (in  the 
program  that  is  being  analyzed):  let  the  set  variable  X  represent  (a  superset  of)  the  values  that  the  program 
variable  x  may  bind  to  at  runtime  and  let  the  set  variable  Range  Xxe  represents  (a  superset  of)  the  values 
that  are  returned  by  Xx.e  during  program  execution.  Then  we  use  the  set  expression  abs(X,  RangeXx  e)  to 
represent  the  functional  abstraction  Xx.e. 
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To  represent  the  application  (A x.e)(y),  we  use  the  set  variables  Y  and  App  and  the  set  constraint 
abs(Y,  App)  D  abs(X,  Range Xx  e).  Here,  the  set  expression  abs(X,  Range Xxe)  represents  Ax.e,  the  set  vari¬ 
able  Y  represents  (a  superset  of)  the  values  y  may  bind  to,  and  the  set  variable  App  represents  a  superset 
of  the  values  that  the  expression  (A x.e)(y)  may  return.  Recall  that  the  constraint 

abs(Y,  App)  D  abs(X,  Range Xx  e) 

reduces  to  the  following  constraints: 

X  D  Y  (which  indicates  that  the  values  in  y  bind  to  the  values  in  x  as  a  result  of  the 

application  (A x.e)(y).) 

App  3  RangeXx  e  (which  indicates  that  the  set  of  values  that  (A x.e)(y)  evaluates  to  is  a  superset 
of  the  values  returned  by  Xx.e. 

Now  let  us  consider  the  CFL- reachability  problem  constructed  to  represent  the  constraint  abs(Y,  App)  D 
abs(X,  Range Xx  e).  The  graph  constructed  to  represent  abs(Y,  App)  D  abs(X,  Range Xx  e)  contains  the 
edges  contra-absi(Y,  (j)),  Rev_Id((j),  (k)),  and  contrajabsi1  {(k),  X)  (where  j  is  the  index  of  the  expres¬ 
sion  abs(Y,  App)  and  k  is  the  index  of  the  expression  abs(X,  Range Xx  e)).  These  edges,  together  with  the 
production 

Id  ::=  contra.absi  RevJd  contra-abs^ 1 

cause  the  CFL- Reachability  Algorithm  to  add  the  edge  Id(Y,  X),  which  encodes  the  constraint  X  D  Y. 

The  constructed  graph  also  contains  the  edges  abs2{RangeXx  e,  (fc)),  Id{(k),(J)),  and  abs^1  ((j) ,  App) . 
These  edges,  together  with  the  production 

Id  ::=  abs2  Id  abs^1 

cause  the  CFL-Reachability  Algorithm  to  add  the  edge  Id{RangeXx  e,  App),  which  encodes  the  constraint 
App  D  Range Xx_e. 

Thus  the  constructed  CFL-reachability  problem  correctly  captures  the  effects  of  reducing  the  constraint 
abs(Y,  App)  D  abs{X,  Range Xx  e).  □ 

7.2  Insight  Into  the  Cubic-Time  Bottleneck  for  Program  Analysis 

As  pointed  out  in  the  Introduction,  the  results  presented  in  this  paper  offer  some  insight  into  the  source  of  the 
cubic-time  bottleneck  for  program  analysis  problems.  Heintze  and  McAllester  have  also  obtained  results  that 
have  a  bearing  on  this  issue  by  considering  the  problem  of  determining  membership  for  languages  defined  by 
2-way  nondeterministic  pushdown  automata  (2NPDA-recognition)  [21].  The  asymptotically  best  algorithm 
known  for  solving  the  2NPDA-recognition  problem  runs  in  0(n3)  time,  and  they  observe  that  if  there  is 
a  linear-time  reduction  from  2NPDA-recognition  to  a  given  problem,  then  that  problem  is  unlikely  to  be 
solvable  in  better  than  0(n3)  time.  In  [21]  reductions  are  given  from  2NPDA-recognition  to  problems  of 
flow  analysis  and  typability  in  the  Amadio-Cardelli  type  system.  (This  is  consistent  with  something  we  had 
observed  in  unpublished  work,  where  we  gave  a  linear-time  reduction  from  the  2NPDA-recognition  problem 
to  CFL-reachability.)  Heintze  and  McAllester  have  also  examined  the  complexity  of  set-based  analysis  with 
data  constructors  [33,  20] . 

7.3  Applications  of  CFL-reachability 

Dolev,  Even,  and  Karp  used  CFL-reachability  to  devise  a  formal  model  for  studying  the  vulnerability  to 
intrusion  by  a  third  party  of  a  class  of  two-party  ( “ping-pong” )  protocols  in  distributed  systems  to  intrusion 
by  a  third  party  [11].  In  particular,  they  reduce  the  security- validation  problem  to  a  (single-source/single- 
target)  CFL-reachability  problem  in  which  labeled  edges  represent  possible  encoding  and  decoding  operations 
and  the  context-free  language  captures  the  interactions  between  possible  actions  that  can  take  place  during 
the  protocol. 

Yannakakis  surveys  the  literature  up  to  1990  on  applications  of  graph-theoretic  methods  in  database 
theory  [51].  He  discusses  many  types  of  graph-reachability  problems,  including  CFL-reachability. 
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A  variety  of  work  exists  that  has  applied  graph  reachability  (of  various  forms)  to  analysis  of  imperative 
programs.  Kou  [32]  and  Hecht  [15]  gave  linear-time  graph-reachability  algorithms  for  solving  intraprocedural 
“bit-vector”  dataflow-analysis  problems.  This  approach  was  later  applied  to  intraprocedural  bi-directional 
bit- vector  problems  [31].  Cooper  and  Kennedy  used  reachability  to  give  efficient  algorithms  for  interproce¬ 
dural  side-effect  analysis  [9]  and  alias  analysis  [10]. 

The  first  uses  of  CFL-reachability  for  program  analysis  were  in  1988,  in  Callahan’s  work  on  flow-sensitive 
side-effect  analysis  [8]  and  Horwitz,  Reps,  and  Binkley’s  work  on  interprocedural  slicing  [22,  23].  Both  papers 
use  only  limited  forms  of  CFL-reachability,  namely  various  kinds  of  matched-parenthesis  (Dyck)  languages, 
and  neither  paper  relates  the  work  to  the  more  general  concept  of  CFL-reachability.  (Dyck  languages  had 
been  used  in  earlier  work  on  interprocedural  dataflow  analysis  by  Sharir  and  Pnueli  to  specify  that  the 
contributions  of  certain  kinds  of  nonexecutable  paths  should  be  filtered  out  [46];  however,  the  dataflow- 
analysis  algorithms  given  by  Sharir  and  Pnueli  are  based  on  machinery  other  than  pure  graph  reachability.) 

Dyck-language  reachability  was  shown  by  Reps,  Sagiv,  and  Horwitz  to  be  of  utility  for  a  wide  variety  of 
interprocedural  program-analysis  problems  [41].  These  ideas  were  elaborated  on  in  a  sequence  of  papers  [25, 
24,  40],  and  also  applied  to  shape  analysis  of  functional  programs  [37].  (See  also  [39]  for  a  survey  of  this 
work.) 

The  second  author  became  aware  of  the  connection  between  program  analysis  and  the  general  concept  of 
CFL-reachability  sometime  in  the  fall  of  1994.  (Of  the  papers  mentioned  above,  only  [37]  and  [39]  mention 
CFL-reachability  explicitly  and  reference  Yannakakis’s  paper  [51].)  The  constructions  of  the  present  paper  for 
converting  set-constraint  problems  to  CFL-reachability  problems — together  with  the  fact  that  set  constraints 
have  been  used  for  program  analysis — show  that  CFL-reachability  using  path  languages  other  than  Dyck 
languages  is  also  of  utility  for  program  analysis. 

7.4  Slicing  Higher-Order  Functional  Languages 

Program  slicing  is  an  operation  that  identifies  semantically  meaningful  decompositions  of  programs,  where  the 
decompositions  consist  of  elements  that  are  not  necessarily  textually  contiguous  [50,  34,  23].  CFL-reachability 
has  been  applied  to  the  problem  of  slicing  programs  written  in  imperative  Algol-like  languages  [23].  Regular- 
tree  grammars  have  been  applied  to  the  problem  of  slicing  programs  written  in  a  first-order  functional 
language  (that  manipulates  heap- allocated  data  structures)  [42]. 

We  now  sketch  how  the  technique  developed  in  the  construction  given  in  Section  5  allows  CFL-reachability 
to  be  applied  to  the  problem  of  slicing  programs  written  in  a  higher-order  functional  language  (again  that 
manipulates  heap-allocated  data  structures).  The  latter  problem  has  not  been  previously  addressed  in  the 
literature  on  program  slicing. 

Specifically  the  slicing  algorithm  will  be  formulated  for  a  higher-order  LISP-like  functional  language  that 
has  the  constructor  and  selector  operations  NIL,  CONS,  CAR,  and  CDR  for  manipulating  heap-allocated 
data  (he.,  lists  and  dotted  pairs),  together  with  appropriate  predicates  (EQUAL,  ATOM,  and  NULL),  but 
no  operations  for  destructive  updating  ( e.g .,  RPLACA  and  RPLACD).  The  constructs  of  the  language  are 

Xi  (ATOM  ei)  (CONS  ei  e2)  (OP  op  ex  e2) 

’c  (NULL  ei)  (IF  ei  e2  e3)  (DEFINE  (f  x i  ■  ■  ■  xk)  ef) 

(CAR  ei)  (CDR  e± )  (EQUAL  e±  e2)  (CALL  f  e  1---ek) 

A  program  is  a  list  of  function  definitions,  with  a  distinguished  top-level  goal  function,  named  main. 
We  assume  that  the  distinguished  atom  “NIL”  is  used  for  terminating  lists,  and  that  there  is  also  a  special 
empty-tree  value  (different  from  NIL)  denoted  by  “?” . 

Following  Reps  and  Turnidge,  we  consider  the  problem  of  slicing  a  functional  program  P(x)  in  terms  of 
symbolically  composing  P(x)  with  an  appropriate  projection  function  n(y)  [42].  Projection  function  n(y) 
characterizes  what  information  should  be  retained  and  what  information  should  be  discarded  from  the  value 
that  P(x)  computes.  We  consider  projection  functions  that  can  be  represented  as  regular  language  of  access 
paths,  where  an  access  path  represents  a  sequence  of  CAR  and  CDR  operations.  We  require  that  the  set 
of  access  paths  defined  by  projection  function  n(y)  be  prefix-closed.  (In  order  to  access  a  part  of  P(x)’s 
return  value  along  an  access  path  p,  it  is  also  necessary  to  access  every  part  of  P(x)’s  return  value  that  is 
reached  along  a  prefix  of  p ;  requiring  that  n (y)  be  prefix-closed  is  not  strictly  necessary,  but  it  simplifies  the 
presentation  below.) 


37 


subexpression  corresponding  graph 

subexpression  corresponding  graph 

• 

ry..  X: 

• 

!c  ’c 

•  CAR 

▲ 

cons]' 

(CAR  d)  °C 

•  CDR 

A 

cons 

(CDR  ei)  oe> 

•  CONS 

A 

consJ  \cons2 

(CONS  e\  e2) 

•  IF 

ctrlOrA  tomicUse v  ^*\ld 
/  Id  \ 

o  o  o 

(IF  ei  e2  e3)  ^ 

< 

i 

(ATOM  ei) 

►  ATOM 

k 

ctrlOrAtomicUse 

>e. 

cons! 

7) 

cmiSiC^  7eqUAL 

Id/  ' \ld 

(EQUAL  ei  e2)  oe> 

« 

A 

(NULL  ei)  c 

>  NULL 

k. 

ctrlOrAtomicUse 

>e. 

•  OP 

ctrlOrAtomicUse/  ^  CtrlOrAtomicUse 
(OP  op  ei  e2)  oe‘  oe 2 

»  CALL 

return~ !  ctrlOrAtomicUse 

of 

.  A- 

input,/  \inputk 

(CALL  f  ex  ...  ek)  oe'  oe‘ 

•  f 

input,  -//  x  N return 

/  input t  \ 

(DEFINE  (/  X!  ...  xk)  ef)  *■  ** 

Legend 

•  expression  node 
o  incomplete  expression 

Table  2:  Summary  of  the  construction  of  a  value-flow  graph  from  each  subexpression  of  a  program.  Reverse 
edges  ( e.g .,  RevJd)  are  not  shown.  Each  occurrence  of  a  variable  x  generates  a  new  node  in  the  value-flow 
graph.  In  addition  to  the  edges  shown  above,  there  is  an  Id  edge  from  each  function  parameter  Xi  to  each 
use  of  Xi  and  a  RevJd  edge  from  the  use  to  the  parameter.  There  is  also  an  Id  edge  from  each  function 
definition  to  a  use  of  the  function  and  a  RevJd  edge  from  each  function  use  to  the  function  definition.  See 
Figures  17  and  19  for  complete  examples  of  value-flow  graphs. 

7r(P(x))’s  return  value  is  a  pruned  copy  of  P(x)’ s  return  value  in  which  every  substructure  that  cannot 
be  reached  by  an  access  path  in  n(y)  has  been  replaced  by  “?”.  The  slicing  problem  becomes  one  of 
understanding  what  parts  of  P(x)  affect  the  return  value  of  n(P(x)).  The  slicing  algorithm  should  therefore 
identify  the  subexpressions  of  P(x)  that  could  not  affect  a  portion  of  P(x)’s  return  value  that  will  be  accessed 
by  7r(y)  (via  an  access  path  in  ir(y)),  and  replace  these  subexpressions  by  ’?.  As  long  as  the  client  of  the 
sliced  program  abides  by  the  access  “contract”  given  by  n(y),  the  values  that  can  be  inspected  will  be  the 
same  as  those  generated  by  P(x). 

We  define  a  graph,  called  a  value-flow  graph,  whose  nodes  represent  the  subexpressions  of  P(x)  and 
whose  edges  represent  dependences  among  subexpressions,  the  passing  of  parameters  and  return  values,  etc. 
Table  2  summarizes  the  construction  of  the  value-flow  graph  from  the  subexpressions  of  P(x).  With  the 
exception  of  ctrlOrAtomicUse  edges,  the  edges  in  a  value- flow  graph  are  similar  in  function  to  the  analogous 
edges  in  Section  5.2.1.  An  edge  ctrlOrAtomicUse{v ,w)  indicates  one  of  the  following  facts:  (i)  the  expression 
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Figure  16:  Example  projection  graphs:  (a)  shows  the  projection  graph  for  the  identity  function;  (b)  shows 
the  projection  graph  for  the  projection  function  that  accesses  everything  in  the  CAR  of  the  return  value  and 
discards  everything  in  the  CDR;  and  (c)  shows  the  projection  graph  for  the  projection  function  that  accesses 
every  odd  element  of  a  list  and  discards  every  even  element. 


w  makes  a  control  use  of  the  values  returned  by  v  ( i.e .,  either  w  calls  a  function  value  returned  by  v,  or  w 
makes  a  branch  decision  based  on  a  boolean  value  returned  by  v ) ;  (ii)  the  expression  w  makes  an  atomic  use 
of  the  values  returned  by  v  (i.e.,  w  uses  the  values  returned  by  v  but  never  performs  a  CAR  or  a  CDR  on 
those  values).  For  purposes  of  slicing,  an  edge  ctrlOrAtomicUse(v,  w)  indicates  that  if  the  values  returned 
by  w  can  affect  tt(P(x)),  then  the  values  returned  by  v  can  affect  n(P(x)),  although  the  CAR  and  the  CDR 
of  values  returned  by  v  cannot  affect  n(P(x)). 

In  adition  to  the  value-flow  graph,  we  define  a  projection  graph  that  represents  the  deterministic-finite 
automaton  (DFA)  that  accepts  the  language  of  access  paths  defined  by  n (y).  The  projection  graph  has  a 
unique  start  node,  one  or  more  accepting  nodes,  and  at  most  one  rejecting  node.  Each  transition  in  the  DFA 
is  represented  in  the  projection  graph  by  a  consjj1  edge  (representing  a  CAR  operation)  or  a  cons^1  edge 
(representing  a  CDR  operation).  Figure  16  shows  some  example  projection  graphs. 

To  apply  CFL-reachability  to  the  slicing  problem,  a  composite  graph  is  created  by  connecting  the  value- 
flow  graph  and  the  projection  graph  with  the  edges  return- (main,  start)  and  ctrlOrAtomicUse(main,  start), 
where  main  is  the  node  in  the  value-flow  graph  that  represents  the  definition  of  main  and  start  is  the  start 
node  of  the  projection  graph.  (The  edge  return-  (main,  start)  indicates  that  we  are  interested  in  the  values 
returned  by  main.  The  edge  ctrlOrAtomicUse(main,  start)  indicates  that  any  execution  of  the  program 
makes  a  control  use  of  the  function  main.)  We  define  a  language  Slice,  such  that  a  5/ice-path  from  a  node  v 
in  the  value-flow  graph  to  an  accepting  node  in  the  projection  graph  indicates  that  the  value  computed  by 
subexpression  v  may  affect  the  value  returned  by  n(P(x)): 


Id 


UnbalRight 


Ctrl  Or  A  tomicSlice 
Slice 


Id  Id 

cansi  Id  consjj1 
cons2  Id  cons ^"1 

input i  RevJd  input -  (for  1  <  i  <  maximum  number  of  function  parameters) 
return  Id  return- 
e 

UnbalRight  consj)1  Id 
UnbalRight  cons Id 
Id 

UnbalRight  ctrlOrAtomicUse 
CtrlOrAtomicSlice  UnbalRight  ctrlOrAtomicUse 
UnbalRight 

CtrlOrAtomicSlice  UnbalRight 


Issues  of  groundness  are  ignored  in  this  grammar.  Furthermore,  the  productions  for  RevJd — which  corre¬ 
spond  exactly  to  the  productions  for  Id  but  in  the  “reverse”  direction — have  not  been  shown  (see  Section  4.1.2 
for  a  discussion  of  reversing  productions). 

In  this  grammar,  the  nonterminal  UnbalRight  represents  an  unbalanced  right  path.  An  unbalanced  right 
path  includes  an  excess  of  selection  operators;  an  edge  UnbalRight (v,w)  indicates  that  the  values  returned 
by  the  expression  w  may  include  substructures  of  the  values  returned  by  the  expression  v.  The  nonterminal 
CtrlOrAtomicSlice  represents  a  control  or  atomic  slice  path.  An  edge  CtrlOrAtomicSlice(v ,w)  indicates  that 
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Figure  17:  Value-flow  graph  and  projection  graph  for  Example  7.2.  Reverse  edges  are  not  shown. 

the  values  returned  by  the  expression  w  are  affected  (e.g.,  via  a  control  dependence)  by  a  substructure  of 
the  values  returned  by  the  expression  v. 

Value-flow  is  performed  by  determining  all  sub-terms  w  for  which  there  is  no  5/ice-path  from  the  node 
that  represents  w  to  an  accepting  node  and  replacing  them  by  ’?;  this  is  done  with  one  exception:  formal 
parameters  are  never  replaced  with  ’?. 

Example  7.2  Consider  the  following  program: 

(DEFINE  (main  y)  (CALL  Swap  y)) 

(DEFINE  (Swap  x)  (CONS  (CDR  x)  (CAR  x))) 


Suppose  that  we  are  only  interested  in  the  CAR  of  the  value  returned  by  this  program.  Figure  17  shows  the 
value-flow  graph  for  this  program  together  with  the  projection  graph  for  the  CAR  projection  function. 

The  results  of  running  the  CFL-Reachability  Algorithm  on  the  graph  in  Figure  17  indicate  that  there 
is  no  5/ice-path  from  the  node  that  represents  the  expression  (CAR  x)  to  an  accepting  node.  There  are 
5/ice-paths  from  all  other  expression  nodes.  For  example,  from  the  expression  (CDR  x),  there  is  a  path  to 
an  accepting  node  that  spells  out  the  string 

cons  i  return  Id  return ~  return  return ~  consj-1. 
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Figure  18:  Derivation  tree  for  the  string  cons i  return  Id  return  return  return  cons1 1 . 

This  string  can  be  derived  from  the  nonterminal  Slice  as  shown  in  Figure  18. 

The  sliced  version  of  the  above  program  is 

(DEFINE  (main  y)  (CALL  Swap  y)) 

(DEFINE  (Swap  x)  (CONS  (CDR  x)  (’?))) 


□ 


Example  7.3  To  illustrate  slicing  of  a  higher-order  function,  consider  the  following  program: 

(DEFINE  (main  y)  (CALL  Swap  y  MyCons)) 

(DEFINE  (Swap  x  pairfn)  (CALL  pairfn  (CDR  x)  (CAR  x))) 

(DEFINE  (MyCons  z  w)  (CONS  z  w)) 


This  program  is  very  similar  to  the  the  program  in  Example  7.2  except  that  the  function  MyCons  is  passed 
as  a  parameter  to  the  function  Swap.  As  in  the  previous  example,  suppose  we  are  interested  in  the  CAR 
of  the  value  returned  by  this  program.  Figure  19  shows  the  value-flow  graph  together  with  the  projection 
graph.  The  results  of  running  the  CFL- Reachability  Algorithm  on  the  graph  in  Figure  19  indicate  that  there 
are  no  Slice- paths  from  the  expression  (CAR  x)  nor  from  the  second  argument  of  the  function  MyCons. 
There  are  slice  paths  from  all  other  expressions.  For  example,  there  is  path  from  the  expression  (CDR  x)  to 
an  accepting  node  that  spells  out  the  string 


input±  RevJd  revJnput 2  Id  revJnput^  RevJd  inputs  Id  cons  ±  return  Id  input2  RevJd  input2  Id  return 

return  Id  return ~  return  return ~  consf . 


(1) 


This  string  can  be  derived  from  the  nonterminal  Slice. 

We  observe  that  the  slice  path  from  (CDR  x)  to  the  accepting  node  contains  a  RevJd- path  from  the 
variable  pairfn  to  the  function  MyCons  and  an  Id-path  from  MyCons  to  pairfn.  These  paths  mean  that  pairfn 
can  take  on  the  value  MyCons.  The  RevJd- path  spells  out  the  string  RevJd  revJnputf  Id  revJnput l2  RevJd 
and  the  Id-path  spells  out  the  string  Id  input2  RevJd  inputf  Id ;  both  of  these  strings  are  substrings  of  (1). 

The  sliced  version  of  the  above  program  is 


(DEFINE  (main  y)  (CALL  Swap  y  MyCons)) 
(DEFINE  (Swap  x  pairfn)  (CALL  pairfn  (CDR  x)  ’?)) 
(DEFINE  (MyCons  z  w)  (CONS  z  ’?)) 


□ 


The  method  described  above  yields  executable  slices.  We  now  briefly  discuss  the  relationship  between 
the  semantics  of  a  slice  and  the  semantics  of  the  orignal  program.  Let  Q(x )  be  the  program  that  results 
from  slicing  P(x )  with  projection  n (y).  There  are  two  important  points: 
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Figure  19:  Value-flow  graph  and  projection  graph  for  Example  7.3.  Reverse  edges  are  not  shown. 

•  In  a  call-by-value  language,  it  is  possible  that  Q(x)  may  terminate  on  inputs  for  which  n(P(x))  diverges. 
Slicing  can  never  introduce  divergence ;  it  can  only  introduce  termination ,  which,  from  a  pragmatic 
standpoint,  is  quite  reasonable.  If  tt(P(x))  does  terminate,  then  n(Q(x))  =  n(P(x)). 

•  It  is  possible  that  Q(x)  yf  n(P(x)).  In  particular,  Q(x)  may  contain  additional  material  that  is  not  in 
7 r(P(x)).  The  reason  that  such  extra  information  may  exist  is  that  slicing  is  a  monovariant  analysis. 
Because  different  portions  of  a  the  result  of  a  function  may  be  needed  at  different  call  sites,  a  function  in 
a  slice  may  return  more  information  than  is  needed  at  a  specific  call  site.  In  addition  more  information 
may  be  present  in  a  variable  than  is  needed  at  all  uses  of  that  variable.  For  these  reasons,  a  sliced 
program  may  return  more  information  than  is  actually  needed.  However,  the  information  returned  by 
a  sliced  program  is  safe  with  respect  to  n(y).  In  particular,  n(Q(x))  =  n(P(x)). 

[42]  contains  a  more  detailed  discussion  of  the  semantic  relationship  between  a  slice  and  its  original  program. 
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7.5  Connection  to  DATALOG 

It  is  also  interesting  to  note  another  fact  about  CFL-reachability  problems:  every  CFL-reachability  problem 
can  be  stated  as  a  chain  program  in  DATALOG  [51];  edges  are  represented  as  facts,  and  productions  are 
encoded  as  Horn  clauses.  In  fact,  the  CFL-reachability  Algorithm  presented  in  Section  2.1.1  in  effect  emulates 
semi-naive  bottom-up  evaluation  of  the  equivalent  DATALOG  program.  This  suggests  that  the  class  of 
DATALOG  programs  that  run  in  cubic  time  may  be  useful  for  program  analysis  (see  also  [36,  5]).  The 
construction  described  in  Section  4  also  implies  that  the  class  of  set-constraints  studied  in  this  paper  may  also 
be  solved  by  converting  them  to  equivalent  DATALOG  programs.  In  fact,  many  parts  of  the  set-constraint-to- 
CFL-reachability-problem  constructions  are  more  easily  expressed  in  DATALOG.  In  particular,  the  addition 
of  reverse  edges,  and  the  tracking  of  ground  information  is  easy  to  express.  The  resulting  DATALOG  program 
would  not  necessarily  be  a  chain  program,  but  it  would  still  run  in  cubic  time. 

7.6  Demand  Analysis 

An  exhaustive  program-analysis  algorithm  associates  with  each  point  in  a  program  a  set  of  “facts”  that 
characterize  (in  some  fashion)  the  execution  state  that  holds  whenever  that  point  is  reached  during  execution. 
By  contrast,  a  demand  program-analysis  algorithm  computes  a  partial  solution  to  a  problem,  when  only 
part  of  the  full  answer  is  needed  —  e.g.,  whether  a  particular  fact  (or  set  of  facts)  holds  at  a  single  specific 
point  [6,  52,  36,  12,  37,  24,  44].  Demand  analysis  can  sometimes  be  preferable  to  exhaustive  analysis  for  the 
following  reasons: 

Narrowing  the  focus  to  specific  points  of  interest.  In  program  optimization,  most  of  the  gains  are 
obtained  from  making  improvements  at  a  program’s  “hot  spots” ,  such  as  the  innermost  loops,  which 
means  that  information  obtained  from  program  analysis  is  really  only  needed  for  selected  locations  in 
the  program.  Thus,  the  use  of  a  demand  algorithm  has  the  potential  to  reduce  greatly  the  amount  of 
extraneous  information  computed.  Similarly,  software-engineering  tools  that  analyze  programs  often 
require  information  only  at  a  certain  set  of  program  points.  Because  it  is  unlikely  that  a  programmer 
will  ask  questions  about  all  program  points,  solving  just  the  user’s  sequence  of  demands  is  likely  to  be 
significantly  less  costly  than  performing  an  exhaustive  analysis. 

Narrowing  the  focus  to  specific  facts  of  interest.  Even  when  information  is  desired  for  every  program 
point  p,  the  full  set  of  facts  at  p  may  not  be  required.  For  example,  in  a  closure-analysis  problem,  we 
may  be  interested  in  determining  which  abstractions  reach  a  certain  specific  application,  rather  than 
determining  that  information  for  all  applications. 

Sidestepping  incremental-updating  problems.  A  transformation  performed  at  one  point  in  the  pro¬ 
gram  can  affect  the  validity  of  program-analysis  information  for  other  points  in  the  program:  In  many 
cases,  the  old  information  at  such  points  is  no  longer  safe;  the  information  needs  to  be  updated  before 
it  is  possible  to  perform  further  transformations  at  such  points.  An  incremental  updating  algorithm 
could  be  used  to  maintain  complete  information  at  all  program  points;  however,  updating  all  inval¬ 
idated  information  can  be  expensive.  An  alternative  is  to  demand  only  the  information  needed  to 
validate  a  proposed  transformation;  each  demand  would  be  solved  using  the  current  program,  thereby 
ensuring  that  the  answer  is  up-to-date. 

Of  course,  determining  whether  a  given  fact  holds  at  a  given  point  may  require  determining  whether  other, 
related  facts  hold  at  other  points  (and  those  other  facts  may  not  be  “facts  of  interest”  in  the  sense  of  the 
second  bullet-point  above).  It  is  desirable,  therefore,  for  a  demand-driven  program-analysis  algorithm  to 
minimize  the  amount  of  such  auxiliary  information  computed. 

For  program- analysis  problems  that  have  been  transformed  into  CFL-reachability  problems,  demand 
algorithms  are  obtained  for  free,  typically  by  solving  a  single-target  or  multi-target  CFL-reachability  prob¬ 
lem  [24].  Because  an  algorithm  for  solving  single-target  (or  multi-target)  CFL-reachability  problems  focuses 
on  the  nodes  that  reach  the  specific  target(s),  it  minimizes  the  amount  of  extraneous  information  computed. 

The  construction  described  in  Sections  4.1  and  4.2  shows  that  set-constraint  problems  can  also  be  solved 
in  a  demand-driven  fashion:  apply  the  construction  to  convert  the  system  of  set  constraints  to  a  CFL- 
reachability  problem;  convert  each  query  to  an  appropriate  single-target  (or  single-source)  CFL-reachability 
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query,  and  solve  accordingly;  finally,  convert  the  answer  back  to  the  form  that  would  be  expected  from 
solving  a  set-constraint  problem. 

It  is  likely  that  demand  algorithms  could  be  designed  that  operate  on  the  set  constraints  directly;  however, 
to  our  knowledge,  this  has  not  been  investigated  before  in  the  literature  on  set  constraints. 
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A  Correctness  of  the  CFL-Reachability  to  Set-Constraint  Con¬ 
struction 

Lemma  A.l  Let  C  be  a  collection  of  set  constraints  containing  the  constraint  V  3  ae i,  where  ae i  is  an 
atomic  expression  that  does  not  appear  in  any  other  constraint.  Let  C'  be  C  unioned  with  the  collection  of 
set  constraints  generated  by  running  the  SC- Reduction  Algorithm  on  C.  Then  for  any  atomic  expression  ae 2 
that  is  ground  in  C1 ,  if  C1  contains  the  constraints  V  3  ae 2  and  17  D  ae  1,  then  C1  also  contains  U  3  ae 2. 

Proof:  The  SC-Reduction  Algorithm  generates  a  constraint  of  the  form  W  3  ae  iff  it  is  given  constraints 
of  the  form  W  3  W'  and  W'  3  ae  and  ae  is  ground.  Thus,  if  U  /  V,  then  the  SC-Reduction  Algorithm 
generates  the  constraint  U  3  aei  iff  aei  is  grounded  and  the  following  collection  of  constraints  are  present: 

U  3  Wx 
Wi  3  W2 


WnDV 


This  implies  that  such  a  collection  of  constraints  must  appear  in  C1  if  U  3  aei  is  in  C' .  It  follows  that  if  C1 
also  contains  the  constraint  V  3  ae 2  where  ae 2  is  ground,  then  the  SC-Reduction  Algorithm  must  also  have 
generated  the  constraint  U  3  ae 2.  □ 

Lemma  3.2  Let  C  be  the  collection  of  set  constraints  constructed  to  represent  the  context-free  reachability 
problem  V .  Let  G  be  the  graph  that  results  from  running  the  CFL-reachability  Algorithm  on  V.  Let  C  be  C 
unioned  with  the  collection  of  set  constraints  generated  by  running  the  SC-Reduction  Algorithm  on  C.  Then 
there  is  an  edge  A(i,j)  in  G  if  and  only  if  C  contains  Xi  3  A(Xj)  and/or  DstyA,i\  3  nodej. 

Proof  of  the  =>■  direction:  First,  we  dispense  with  a  technical  detail  that  is  the  same  in  all  parts  of  the 
proof.  In  many  subcases,  we  will  be  able  to  show  that  C'  contains  constraints  of  the  form  U  3  c]”1(IT) 
and  W  3  c(Y)  and  need  to  argue  that  C1  contains  U  3  Y.  In  all  the  cases  that  arise  in  the  proof,  we  can 
show  that  C'  must  contain  a  constraint  of  the  form  Y  3  nodej.  This  will  follow  either  from  the  original 
construction  of  C  (if  Y  is  one  of  the  variable  Xj)  or  from  the  suppositions  in  effect  at  that  point  of  the  proof 
(if  Y  is  of  the  form  -Ds£[c,fc])-  In  either  case,  the  groundness  of  Y  will  be  assured.  To  avoid  clutter  in  the 
following  discussion,  we  will  not  mention  the  groundness  properties  explicitly  when  we  perform  reductions. 

Assume,  on  the  contrary,  that  there  is  an  edge  A(i,j)  in  G  such  that  C  contains  neither  X f  3  A(Xj) 
nor  Dst[A,i\  3  nodej.  Note  that  for  each  edge  B(u,v)  in  the  original  graph  of  the  context-free  reachability 
problem,  C  (and  hence  C)  contains  the  constraint  Xu  3  B(XV).  Thus  A(i,j)  must  have  been  generated  by 
the  CFL-reachability  Algorithm. 

Without  loss  of  generality,  let  A(i,j )  be  the  first  edge  that  the  CFL-reachability  Algorithm  generates  such 
that  C1  contains  neither  Xi  3  A(Xj)  nor  Dst[A,i]  3  nodej.  There  are  three  reasons  that  the  CFL-reachability 
algorithm  might  have  introduced  the  edge  A(i,j): 

case  1:  The  context-free  grammar  contains  the  production  A  ::=  e.  In  this  case  i  =  j.  However,  for  each 
production  of  the  form  A  ::=  e,  for  each  node  k,  C  (and  hence  C')  contains  the  constraint  X\.  3  A{X]/). 
Thus  in  this  case,  C'  must  contain  the  constraint  Xi  3  A^Xf). 
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case  2:  The  context-free  grammar  contains  the  production  A  ::=  B,  and  the  edge  B{i,j)  is  present.  Since 
B(i,j )  must  be  present  before  A(i,j),  and  A(i,j)  is  the  first  edge  generated  by  the  CFL-reachability 
Algorithm  such  that  C1  contains  neither  Xi  D  A(Xj)  nor  Ds\a,%\  2  nodej,  we  conclude  that  C1  must 
contain  Xi  D  B(Xj )  and/or  Dst[s,i]  2  nodej.  The  construction  also  guarantees  that  C1  also  contains 
the  constraints  Xi  D  A(Dst[A,i] )  and  Dst[A,i\  2  B^x(Xi)  (to  encode  the  production  A  ::=  B)  and  the 
constraint  Xj  D  nodej  (to  encode  node  j). 

The  constraints  Dst[A,i\  2  Rf^A/)  and  Xi  D  B(Xj)  combine  to  give  the  constraint  Dst[A,i]  2  Xj. 
The  constraints  Dst[A,i ]  2  Xj  and  Xj  D  nodej  reduce  to  the  constraint  Ds\a,{\  2  nodej.  Thus,  if  C1 
contains  Xi  D  B(Xj),  it  must  also  contain  Dst[A,i]  2  nodej. 

If  C1  contains  Dst[s,i\  2  nodej,  then  it  must  also  contain  the  constraint  Xi  D  B(Dst[B,i])  (because 
the  variable  Dst[B,i\  is  introduced  iff  this  constraint  is  introduced).  The  constraints  Dst[A,i]  2  B^1(Xi) 
and  Xi  D  B(Dst[B,i])  combine  to  give  Dst[A,i\  2  nodej.  Thus,  if  C'  contains  Dst[s,i]  2  nodej,  it  must 
also  contain  DstjA.i\  2  nodej. 

In  either  case  C'  must  contain  the  constraint  Dst[A,i]  2  nodej. 

case  3:  The  context-free  grammar  contains  the  production  A  ::=  B  C  and  the  edges  B(i,  k)  and  C(k,j)  are 
present.  C  must  contain  the  constraints 

Dst[A,i]  2  Ci1(Rchd^B- 1  ij)  and 
Rch.d[B- iA  2  B^~1(Xi) 

to  encode  the  production  A  ::=  B  C.  Since  the  edge  B(i,k)  is  present  before  A(i,j),  C1  must  also 
contain  Xi  2  B(X *,)  and/or  Dst[s,i\  2  nodeu-  This  gives  us  two  subcases: 

case  3. a:  Suppose  C'  contains  Xi  2  B( X/.).  This  constraint  and  the  constraint  Rchd^B- 1  ^  2  B^x(Xi) 
give  the  constraint  Rchd^B- 1  ^  2  X (Thus,  in  this  case,  C'  must  contain  Rchd^B- 1  ^  2  X *,.) 

Since  the  edge  C{k,j)  was  present  before  edge  A{i,j),  C'  must  also  contain  Xy.  2  C(Xj)  and/or 
Bst[c,k]  2  nodej.  This  gives  two  subcases: 

case  3.a.i:  Suppose  C'  contains  X\.  2  C(Xj).  The  constraints 
RchdyB- 1  ^  2  Xk  and 
Xk  2  c(Xj) 

combine  to  give  the  constraint  Rchd^B- 1  ^  2  C(Xj).  The  constraints 

Dst[A,i\  2  C^iRchd^-i  ij)  and 
Rchd B  2  C(Xj) 

reduce  to  the  constraint  Dst[A,i]  2  Xj.  This  constraint  combines  with  Xj  2  nodej  to  give 
Dst[A,i]  2  nodej.  So  in  this  case,  C  must  contain  Ds\a, i]  2  nodej. 
case  3.a.ii:  Suppose  C  contains  Dst[c,k\  2  nodej.  The  construction  introduces  a  variable  of  the 
form  Dst[c,k]  iff  it  also  introduces  the  constraint  Xk  2  C(Dst[c,k ]);  thus  C  must  contain 
Xk  2  C{Dstyc,k\)-  Given  the  constraints 
Rchd^B- 1  ^  2  Xk  and 
Xk  2  C(Dst[c,k ]) 

the  SC-Reduction  Algorithm  produces  the  constraint  Rchd^B-i  ^  2  C(Dst{c,k})-  The  con¬ 
straints 

Dst[A,i]  2  C2\Rchd[B-i  ij)  and 
Rchd[B-i  t]  2  C(Dst[C,k]) 

reduce  to  Dst[A,i]  2  Dst[C,k]-  This  constraint  and  Dst[C,k]  2  nodej  reduce  to  the  constraint 
Dst[A,i]  2  nodej.  Thus,  in  this  case,  C1  must  contain  Dst[A,i\  2  nodej. 

case  3.b:  Suppose  C1  contains  Dst^sj]  2  nodek-  This  implies  that  C1  contains  the  constraint  Xi  2 
B(Dst[s,i\ )  (since  the  variable  Dst[s,i]  is  introduced  iff  this  constraint  is  added  to  C  during  the 
original  construction).  The  constraints 
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Rchdw-i  3  Bx  1(Xi) 

X.  38(4,) 

reduce  to  the  constraint  Rchd^B- 1  ^  3  The  constraints 

RchdyB-i  ^  3  Ds\ba  and 
3  node* 

combine  to  give  ^  3  node*. 

Again,  we  know  that  C  must  contain  Xk  3  C(Xj )  and/or  3  nodej: 

case  3.b.i:  Suppose  C'  contains  AT  3  C(Xj).  Since  the  only  occurrence  of  the  atomic  expression 
nodek  in  C  is  in  the  constraint  Xk  3  nodek,  we  can  use  Lemma  A.l  and  the  presence  of 
Xk  3  C(Xj )  and  .Rcftd^-i  ^  3  nodek  in  C1  to  conclude  that  RchdyB- 1  ^  3  C(Xj)  is  also  in 
C1.  The  constraints 

Ds\aA  2  C^1  (Rchd^B-i  ij)  and 
RchdyB- 1  ^  3  C(Xj ) 

reduce  to  the  constraint  3  Xj.  This  constraint  combines  with  the  constraint  Xj  3 

nodej  to  give  Dst[AA  3  nodej.  Thus  in  this  case,  C1  must  contain  Dst[A,i]  3  nodej. 
case  3.b.ii:  Suppose  C1  contains  Dstyc,k\  2  nodej.  Then  C1  must  also  contain  Xk  3  C(Dst[c,k ])■ 
Again  by  use  of  Lemma  A.l,  and  the  presence  of  the  constraints  Xk  3  nodek,  Dst[B,i\  2  nodek, 
and  AT  3  we  conclude  that  C1  contains  the  constraint  Dst[B,i\  2  C(Dst[c,k])-  The 

constraints 

Dst[A,i]  2  C^1  (Rchd^B-i  ij)  and 
Dst[B,i\  3  C'(Z?sf[C>j!]) 

combine  to  give  Dst[A,i]  3  which  combines  with  Dsf[C >fc]  3  nodej  to  give  Ds\aA  3 

nodej.  Thus  in  this  case,  C1  must  contain  Dst[A,i]  3  nodej. 

For  all  of  the  possible  cases  that  may  cause  the  CFL-reachability  Algorithm  to  introduce  the  edge  A(i,j),  we 
have  shown  that  C1  contains  Xi  3  A(Xj)  or  Dst[A,i]  2  nodej.  This  contradicts  the  assumption  that  A(i,j) 
is  the  first  edge  introduced  by  the  CFL-reachability  Algorithm  such  that  C1  contains  neither  Xi  3  A(Xj) 
nor  Dst[A,i]  3  nodej,  and  implies  that  there  can  be  no  such  edge  A(i,j).  □ 

Proof  of  the  <=  direction:  We  need  to  show  that  the  presence  of  the  constraint  X,  3  A(Xj)  or  the 
constraint  Dst[A,i]  3  nodej  in  C  allows  us  to  assert  that  the  edge  A(i,j)  appears  in  G. 

The  constraints  in  C  (the  initial  collection  of  constraints  constructed  to  represent  the  CFL-reachability 
problem)  must  have  one  of  the  following  forms: 


Rchd^-i  ^  3  B~l(Xi) 
Dst[AA  3  C^1(Rchd[B-i  i]) 
Xi  3  A(Dst[AA) 

Dst[A,i\  3  B^(Xi) 

Xi  3  nodei 
Xi  3  A(Xj) 

XiDXi 


(Follow  R-edges  from  node  i;  used  to  encode  A  ::=  B  C) 

(Follow  C-edges  from  those  nodes;  used  to  encode  A  ::=  B  C) 

(Add  A-edges  to  the  reached  nodes;  used  to  encode  A  ::=  B  C  and 
A  ::=  B) 

(Follow  B-e dges  from  Xf,  used  to  encode  A  ::=  B ) 

(Encode  Xi  as  representing  node  i) 

(Encode  an  A  edge  from  i  to  j) 

(Used  to  encode  and  A  ::=  e) 


Following  the  rules  of  the  SC-Reduction  Algorithm,  the  constraints  in  C  may  give  rise  to  constraints  of 
the  following  additional  forms  (which  may  appear  in  C'): 


Rchd[A-\i]  2  Xj 
Rchd[A-i  A  3  Dst[A,i\ 
Rchd^c- ij4]  3  B(Xj) 
Rchd[c- 3  B(Dst[Bj]) 
Rchd[A- 1  ^  3  nodej 


/Is  L  _v  3  Xj 
Dst[AA  3  Dst[B ,j] 
Dst[B,j]  3  C(Xk) 
Ds\b,j]  2  C(Dst[C,k ]) 
Dst[AA  2  nodej 
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Selected  constraint  form 
and  associated  assertion 

Matching  constraint  form 
and  associated  assertion 

Produced  constraint 
and  associated  assertion 

Rchd^A- 1  D  A^L(Xi) 

Null  assertion 

^  D  A{Xj) 

(Encodes  A(i,  j )) 

Rchdr  A  — i  D  Xj 

aA(i,  j)  e  Eg” 

Xi  D  A(Dst[AA}) 

Null  assertion 

Rchd^—i  ^  ^  Dst[A,i] 

Null  assertion 

Dst[A,i\  =)  B^  (Rchd^— 1 
(Encodes  A  ::=  C  B) 

Rchd^c- 1  D  B(Xj ) 

“3 k[C{i,k)  €  Eg  and  B{k,j)  €  Eg]” 

Bst[A,i]  =?  Xj 
“A(i,  j)  e  Eg” 

■D«t[A,i]  5  Dst[Bj] 

u\/n[B(j,  n)  €  Ea  imp.  A(i,n)  €  Ea]” 

Dst[AA  D 
(Encodes  A  ::=  B) 

Xi  D  B(Xj) 

(Encodes  B{i,j)  £  Eg) 

Dst[A^]  D  Xj 
“A(i,  j)  £  Eg” 

Xi  D  B(Dst[B,i\) 

Null  assertion 

Dst[A^]  D  Dst[B:i] 

“ Vn[B{i,n )  €  Eg  imp.  A(i,n)  €  Eg]” 

Rchdr  A-i  .-I  Z)  Xj 
“ A(i,j >  EEg ” 

Xj  D  nodej 

Null  Assertion 

Rchd,,- 1  ...  D  nodej 
“A(i,  j)  e  Ea” 

Xj  D  B(Xh) 

(Encodes  B{j ,  k)) 

Rchd[A-iA  D  B(Xk) 

“3 j[A(i,j)  £  Eg  and  B{j,  k)  €  Eaf 

Rchd[A-iA  D  B(Dst[BJ]) 

“A(i,  j)  €  Eg” 

Rchd^—i^  ^  Bst[A,i] 

Null  assertion 

Dst[A,i]  5  B(Xj) 

“3 k[A(i,  k)  €  Eg  and  B(k ,  j )  €  Eg” 

Rchd[A--i  i]D  B(Xj) 

“3 k[A{i,  k)  €  Eg  and  B{k,  j )  €  Eg]” 

Dst[A,i]  5  B(Dst[B:j]) 

“A{i,  j)  €  EG” 

Rchd[A-i  t]D  B(Dst[Bd]) 

“A(i,  j)  £  Eg” 

Dst[AA]  O  Xj 
“ A(i,j )  €  Eg ” 

Xj  ~D  nodej 

Null  assertion 

Dst[A  ij  D  nodej 

uA(i,  j)  e  £g” 

1 

X,-  D  B(Xk) 

(Encodes  B(j ,  k )) 

Dst[A,i]  D  B(Xk) 

“3 j[A(i,j)  €  Ea  and  B(j,  k)  €  Ea]” 

Xj  ^  B(Dst[Bj]) 

Null  assertion 

Dst[AA  D  B(Dst[B:j]) 
uA(i,  j)  e  Eg” 

Dst[AA  D  Dst[BJ] 

“Vn[R(j,  n)  €  Eg  imp.  A{i,n )  e  Eg]” 

Dst[Bj]  D  nodek 
“B(j’k)  €  Eg” 

^  nodek 
“A(i,  k)  e  Eg” 

I 

Dst[Bj]  D  C(Xk) 

“3 n[B(j,  n)  €  Eg  and  C(n,  k)  €  Eg]” 

Dst[AA]  D  C(Xk) 

“3 n[A{j,  n)  €  Eg  and  C(n ,  k)  €  Eq]” 

Dst[Bj]  D  C{Dst[C,h ]) 

“B(j.k)  €  Eg” 

Dst[AA]  D  C(Dst[c,k ]) 

“A(j,  k)  €  Eg” 

Table  3:  Summary  of  the  reductions  that  the  SC-Reduction  Algorithm  may  perform  on  a  constructed  set- 
constraint  problem.  For  each  line  of  the  table,  column  3  shows  the  constraint  that  results  from  reducing  the 
constraints  shown  in  columns  1  and  2.  Each  constraint  is  shown  with  its  purpose  in  the  original  construction, 
or  with  its  associated  assertion  in  Lemma  3.2,  where  Eg  denotes  the  set  of  edges  in  graph  G.  The  highlighted 
entries  indicate  the  key  result  for  Lemma  3.2. 


Note  that  a  constraint  of  the  form  Xi  D  A (Xj)  cannot  be  generated  by  the  SC-Reduction  Algorithm; 
this  means  that  if  Xi  D  A(Xj)  appears  in  C1,  it  must  also  appear  in  C.  This  means  that  Xi  D  A(Xj)  either 
encodes  A(i,j),  or  else  j  =  i,  and  Xi  D  A(Xi)  encodes  a  result  of  the  production  “A  ::=  e”  by  representing 
the  edge  A(i,i).  In  either  case,  G  contains  the  edge  A(i,j). 

It  remains  for  us  to  show  that  if  C'  contains  a  constraint  of  the  form  Dst[A,i]  5  nodej,  then  G  contains 
the  edge  A(i,j).  To  do  this,  we  associate  as  assertion  about  the  graph  G  with  every  constraint  generated  by 
the  SC-Reduction  Algorithm  as  shown  below  (where  Eq  is  the  set  of  edges  of  the  graph  G) : 
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Constraint  form: 

Rchd[A~\H  2  XJ 
Rchd[A-i  A  3  Ds\aa 

Rchd[A~\i\  2  B{Xj) 

Rchd[A-\i\  2  B{Ds\b&) 
Rchd[A-i  q  3  nodej 
I)s  /  v  3  Aj 

5  5(Aj) 

Dst[A,i]  5  «orfej 


Associated  assertion: 
uA{i,j)  g  EG” 

Null  assertion 

“3fc[A(i,  jfe)  G  Eq  and  E(fc,  j)  G  EG]” 
“A<b  j>  e  Eg ” 

G  Eg” 

“A<l  j)  e  £g” 

“Vn[E(j,n)  G  Eg  impl.  A(i,n)  G  Ea]” 
“3k[A(i,k)  G  Eg  and  B{k,j)  e  EG]” 
“A{i,j)  G  Eg” 


Table  3  summarizes  the  reductions  that  may  take  place  in  a  set-constraint  problem  created  by  our  con¬ 
struction;  each  constraint  is  shown  with  its  associated  assertion.  It  is  clear  that  for  all  lines  of  Table  3,  the 
assertion  A  associated  with  a  generated  constraint  V  3  sexp  (shown  in  column  3)  is  supported  by  the  asser¬ 
tions  associated  with  the  constraints  (shown  in  columns  1  and  3)  that  were  reduced  to  V  D  sexp.  Since  the 
(implicit)  assertions  associated  with  the  constraints  in  C  follow  from  the  original  construction,  it  follows  that 
for  each  constraint  generated  by  the  SC-Reduction  Algorithm,  the  associated  assertion  is  true.  In  particular, 
for  any  constraint  of  the  form  Dst[A,i]  5  nodej  in  C',  it  follows  that  G  contains  the  edge  A{i,j)  (see  the  two 
highlighted  boxes  in  Table  3).  □ 


B  Correctness  of  the  Set-Constraint  to  CFL-reachability  Con¬ 
struction 

In  this  section  we  prove  the  lemmas  used  in  Section  4.3.  We  use  the  following  definitions: 

C  is  a  collection  of  set  constraints. 

V  is  the  CFL-reachability  problem  constructed  to  represent  C. 

C  is  the  collection  of  set  constraints  that  results  from  running  the  SC-Reduction  Algorithm  on  C  ( i.e .,  C1 
is  C  unioned  with  the  constraints  generated  by  the  SC-Reduction  Algorithm). 

G  is  the  graph  of  the  CFL-reachability  problem  V. 

G'  is  the  graph  that  results  from  running  the  CFL-reachability  algorithm  on  V  (i.e.,  G'  is  G  augmented 
with  the  edges  added  by  the  CFL-reachability  Algorithm). 

To  prove  Lemmas  4.4  and  4.5,  it  is  useful  to  have  the  following  observation: 

Observation  B.l  If  G'  contains  the  edges  Ground{V \,V\)  ...  Ground(Vr,Vr),  and  c(V\,  V2, . . . ,  Vr)  is  an 
atomic  expression  used  in  C  with  index  k,  then  G'  contains  the  edge  Ground((k),  ( k )). 

This  follows  from  the  construction  of  V.  In  particular,  the  CFL-reachability  Algorithm  will  use  the 
production 

Ground  ::=  edge(k)to\ \  Ground  edgeVito(k)  ...  edge(k)toVr  Ground  edgeVrto(k ) 

with  the  appropriate  edges  to  induce  the  edge  Ground((k) ,  (k)) .  (See  Section  4.1.2  for  details  about  how 
groundness  information  is  handled  in  the  constructed  CFL-Reachability  problem.) 

Lemma  4.4  If  C'  contains  the  constraint  V  3  c(V±,  V2, . . . ,  Vr),  then  G'  contains  the  edge  Id{(k),  V),  where 
k  is  the  index  of  c(Vi ,  Vi, . . , ,  Vr). 

Proof:  To  show  this,  we  must  simultaneously  prove  the  following: 

(a)  If  C'  contains  Vi  3  V2,  then  G'  contains  Id(V2,Vi). 

(b)  If  Ris  ground  in  C',  then  G'  contains  Ground(Vv,Vv)  ■ 


48 


Observe  that  the  construction  of  G  guarantees  the  following: 


•  If  C  contains  VO  c(V f,  V2,  ■  ■  ■ ,  Vr),  then  G  (and  hence  G")  contains  Id{(k),V). 

•  If  C  contains  Vi  OV2,  then  G  (and  hence  G1)  contains  Id{V 2,  Vi). 

To  prove  the  lemma  and  goals  (a)  and  (b) ,  we  show  that  the  following  conditions  hold  when  the  SC-Reduction 
Algorithm  is  run  on  C: 

(1)  If  the  SC-Reduction  Algorithm  generates  the  constraint  VO  c(Vi,  V2, . . . ,  V),  then  G'  contains  the  edge 

Id{(k),V),  where  k  is  the  index  of  c(V±,  V2,  . . . ,  Vr). 

(2)  If  the  SC-Reduction  Algorithm  generates  the  constraint  hi  O  V2,  then  G'  contains  the  edge  Id{\ 2,  V). 

(3)  If  the  SC-Reduction  Algorithm  marks  the  variable  V as  ground,  then  G'  contains  the  edge  Ground(V,V). 
The  lemma  follows  immediately  from  condition  (1). 

Assume,  on  the  contrary,  that  one  or  more  conditions  (l)-(3)  fails.  Then  there  must  be  some  first  action 
taken  by  the  SC-Reduction  Algorithm  that  causes  the  conditions  to  fail.  There  are  three  cases: 

case  1:  Suppose  the  SC-Reduction  Algorithm  generates  the  constraint  V  O  c(V i,  V2, . . . ,  hi),  and  G'  does 
not  contain  the  edge  Id((k),  V). 

The  only  way  for  the  SC-Reduction  Algorithm  to  generate  this  constraint  is  from  the  constraints 

V  O  U  and 
udc(v i,y2,...,hi) 

where  c( hi,  hi,  •  •  • ,  Vr )  is  ground.  Since  the  SC-Reduction  Algorithm  has  established  that  c(h i,  hi, . . . ,  Vr) 
is  ground,  the  variables  hi  ...  Vr  must  be  marked  as  ground.  Since  this  is  the  first  failure  of  con¬ 
ditions  (l)-(3),  G'  must  contain  the  edges  Id((k),U)  and  Id(U,V)  and  the  edges  Groundi} i,hi) 

...  Ground{Vr,Vr).  This  allows  us  to  use  Observation  B.l  to  conclude  that  G'  contains  the  edge 
Ground((k ),  ( k )).  Finally,  G'  contains  the  edge  ae((k ),  ( k )). 

Since  the  context-free  grammar  of  V  contains  the  production  “Id  ::=  Ground  ae  Id  Id,”  it  follows 
that  G'  must  contain  the  edge  Id((k),V),  which  contradicts  our  supposition. 

case  2:  Suppose  the  SC-Reduction  Algorithm  generates  the  constraint  U  O  hi,  and  6"  does  not  contain  the 
edge  Id{Vi,  U). 

The  only  way  for  the  SC-Reduction  Algorithm  to  generate  this  constraint  is  from  constraints  of  the 
form 


UDCi  1(h/)  and 
VD  c(Vi,V2,...,V,) 

where  c(V,  V2, . . . ,  Vr)  is  ground.  The  SC-Reduction  Algorithm  performs  this  reduction  only  if  it  has 
already  marked  the  variables  V± . .  .Vr  as  ground.  (This  follows  because  the  SC-Reduction  Algorithm 
adds  the  constraint  c(Vl  ,  V2,  .  .  . ,  Vr)  to  its  worklist  only  if  V  ...  V-  have  been  marked  ground.)  Since 
this  is  the  first  failure  of  conditions  (l)-(3),  G'  must  contain  the  edge  Id((k),V)  and  the  edges 

Ground(Vi,Vi) 

Ground(V2,V2) 


Ground(Vr,Vr ) 

By  Observation  B.l,  we  conclude  that  G'  also  contains  the  edge  Ground((k),  ( k )).  From  the  construc¬ 
tion  of  G,  it  follows  that  G'  contains  the  edges  Ci(V,  ( k ))  and  c^“1(V,  U),  as  well. 

We  also  have  that  the  context-free  grammar  of  'P  contains  the  production  “Id  ::=  Cj  Ground  Id  cj 1 .” 
Given  the  above  edges  and  this  production,  the  CFL-reachability  Algorithm  generates  the  edge  Id(Vi,  U), 
which  contradicts  our  supposition. 
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case  3:  Suppose  the  SC-Reduction  Algorithm  marks  the  variable  V  ground  and  6"  does  not  contain  the 
edge  Ground(V,V). 

There  are  two  reasons  why  the  SC-Reduction  Algorithm  might  mark  V as  ground,  which  are  covered 
is  the  following  subcases: 

case  3. a:  Suppose  the  SC-Reduction  Algorithm  marks  V  as  ground  because  U  is  marked  as  ground 
and  the  constraint  V  O  U  is  present.  Since  this  is  the  first  failure  of  any  of  the  conditions  (1)- 
(3)  above,  we  have  that  G'  must  contain  the  edges  Ground(U,U)  and  Id{U,V).  It  follows  from 
the  construction  of  V  that  G'  also  contains  the  edges  edgeVtoV(V,V)  and  Rev_Id(V,U).  The 
context-free  grammar  of  V  contains  the  production 

Ground  ::=  edgeVtoV  Revld  Ground  Id  edgeVtoV 
This  means  that  G'  must  contain  the  edge  Ground(V,  V),  which  is  a  contradiction. 

case  3.b:  Suppose  the  SC-Reduction  Algorithm  marks  the  variable  V  ground  because  Vi  ...  K  are 
marked  ground  and  the  constraint  VO  c(Vi ,  V2 , . . . ,  Vr)  is  present.  Since  this  is  the  first  failure  of 
any  of  the  conditions  (l)-(3),  G'  contains  the  edge  Id((k),V).  We  also  have  that  G'  contains  the 
edge  Ground((k),  ( k ))  (by  the  argument  in  case  1  above).  By  the  construction  of  V,  it  follows  that 
G'  also  contains  the  edges  edgeVtoV(V,V)  and  RevJd(V,  ( k )).  This  means  that  the  production 

Ground  ::=  edgeVtoV  Revld  Ground  Id  edgeVtoV 
causes  the  CFL-reachability  Algorithm  to  induce  the  edge  Ground{V,V) ,  which  is  a  contradiction. 

Thus,  there  can  be  no  action  taken  by  the  SC-Reduction  Algorithm  that  causes  conditions  (l)-(3)  to  be 
violated.  □ 

Lemma  4.5  If  G'  contains  the  edge  Id((k),V),  then  C1  contains  the  constraint  V  O  c(V±,  Vz, ...  ,Vr)  where 
c(V 1,  V2, . . . ,  Vr)  is  the  atomic  expression  with  index  k. 

Proof:  To  show  this,  we  need  to  prove  a  stronger  property,  namely  that  the  following  four  conditions  hold: 

(1)  If  G"  contains  the  edge  Id((k),V),  then  C1  contains  the  constraint  VO  c(Vl,  V2, ... ,  Vr),  where  c(Vl,  V2, . . . ,  Vr) 

has  index  k. 

(2)  If  G'  contains  the  edge  Id{Vi,  Vj),  then  C'  contains  the  constraint  Vj  O  Vi. 

(3)  If  G'  contains  the  edge  Ground(V,V) ,  then  the  variable  V  is  ground  in  C' . 

(4)  If  G'  contains  the  edge  Ground{(k),  (k)),  then  the  atomic  expression  c(V f,  Vi, . . . ,  Vr)  is  ground  in  C1, 

where  c(V (,  V2, . . . ,  Vr)  has  index  k. 

Note  that  edges  from  G  satisfy  the  above  conditions.  Thus,  if  G'  contains  an  edge  e  such  that  one  or  more 
of  the  above  conditions  is  not  satisfied,  then  e  must  have  been  added  by  the  CFL-reachability  Algorithm. 
Assume,  on  the  contrary,  that  such  an  edge  e  exists  in  G'.  Without  loss  of  generality,  let  e  be  the  first  edge 
generated  by  the  CFL-reachability  Algorithm  that  causes  one  (or  more)  of  the  above  conditions  to  fail. 

case  1:  Suppose  e  has  the  form  Id{(k),V)  and  condition  (1)  is  violated.  The  only  way  the  CFL-Reachability 
Algorithm  can  generate  this  constraint  is  from  the  production  “Id  ::=  Ground  ae  Id  Id.”  This 
implies  that  the  edges  Ground{(k) ,  (k)) ,  Id((k),U),  and  Id(U,V)  are  present  before  e.  Since  e  is  the 
first  failure  of  conditions  (l)-(4),  it  follows  that  C  contains  the  constraints 

U  O  c(Vi,  V2, . . . ,  Vr)  and  VO  U 

and  c(Vi,  V2,  •  •  • ,  Vr)  is  ground  in  C .  This  means  that  C  must  contain  V  O  c(V L,  V2, . . . ,  Vr).  which 
contradicts  our  assumption. 

case  2:  Suppose  e  has  the  form  Id(Vi,Vj )  and  condition  (2)  is  violated.  To  generate  this  edge,  the  CFL- 
Reachability  Algorithm  must  use  a  production  of  the  following  form: 

Id  ::=  Ci  Ground  Id  cf1 
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This  implies  that  G'  must  contain  the  edges 


a(Vi,(k)), 

Ground((k),  ( k )), 

Id{(k),U),  and 

where  k  is  the  index  of  an  atomic  expression  of  the  form  c(. . .  Vi . . .).  Since  this  is  the  first  failure 
of  conditions  (l)-(4),  the  edge  Ground((k),  ( k ))  implies  that  c(. . .  Vi . . .)  is  ground  in  C',  and  the  edge 
Id((k),U)  implies  that  C'  contains  the  constraint  U  3  c{. .  .Vi . . .).  The  edge  c^fU,  Vj)  encodes  the 
constraint  Vj  3  c^“1(C/),  which  must  be  in  C.  It  follows  that  C'  must  contain  the  constraint  Vj  3  Vi, 
which  contradicts  our  supposition. 

case  3:  Suppose  e  has  the  form  Ground(V. ,  V)  and  condition  (3)  is  violated.  To  generate  this  edge,  the 
CFL-Reachability  Algorithm  uses  the  following  production: 

Ground  ::=  edgeVtoV  RevJd  Ground  Id  edgeVtoV 

It  follows  that  G1  must  contain  either  the  edges  Ground(U ,  U)  and  Id{U,  V)  or  the  edges  Ground{(k),  ( k )) 
and  Id((k),V).  In  either  case,  since  this  is  the  first  failure  of  conditions  (l)-(4),  it  follows  that  V  is 
ground  in  C1 ,  which  contradicts  our  supposition. 

case  4:  Suppose  e  has  the  form  Ground{(k),  ( k ))  and  condition  (4)  is  violated.  Let  c(\ i,V2, . . . ,  Vr )  be  the 
atomic  expression  with  index  k.  The  only  way  for  the  CFL-reachability  Algorithm  to  generate  the  edge 
Ground((k),  ( k ))  is  by  using  the  following  production: 

Ground  ::=  edge(k)toVi  Ground  edgeVito(k)  ...  edge(k)toVr  Ground  edgeVrto(k) 

This  implies  that  the  edges  Ground{V i,  Vi)  . . .  Ground(Vr ,  Vr)  are  present  before  e  is  generated.  Since 
the  introduction  of  e  is  the  first  failure  of  conditions  (l)-(4),  this  implies  that  the  variables  Vl  . . .  Vr 
are  all  ground  in  C' .  But  then  c(Vi,  V2, . . . ,  Vr)  must  also  be  ground  in  C',  which  contradicts  our 
supposition. 

Thus,  the  CFL-reachability  Algorithm  does  not  generate  any  edge  that  causes  conditions  (l)-(4)  to  fail.  The 
lemma  is  the  same  as  condition  (1).  □ 
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Table  3:  Total  work  perform elly  the  CIReachability  Algorithm  on  a  constructed  problem.  Column  1  shows  the  forms  of  the  labels 
used  in  a  constructed  problem.  Column  2  gives  a  bound  on  the  number  of  edges  with  labels  of  the  form  listed  in  column  1.  Column  3 
shows  productions  in  which  labels  from  column  1  appear  on  the  right  hand  side.  Column  4  shows  the  number  of  productions  of  the 
form  in  column  3  that  will  be  examined  when  considering  a  fixed  edge  with  a  label  of  the  form  in  column  1.  Column  5  shows  the 
number  of  new  edges  that  may  be  produced  in  total  for  all  of  the  productions  counted  in  column  4.  The  total  work  performed  is 
bounded  by  (column  4  +  column  5)  *  column  2. 
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Table  4:  Work  performeHy  the  CIReachability  Algorithm  on  a  problem  constructed  from  an  ML  set-constraint  problem.  (See  also 
Table  3).  Column  1  shows  the  forms  of  the  labels  used  in  a  constructed  problem.  Column  2  gives  a  bound  on  the  number  of  edges 
with  labels  of  the  form  listed  in  column  1.  Column  3  shows  productions  in  which  labels  from  column  1  appear  on  the  right-hand  side. 
Column  4  shows  the  number  of  productions  of  the  form  in  column  3  that  will  be  examined  when  considering  a  fixed  edge  with  a  label 
of  the  form  in  column  1.  Column  5  shows  the  number  of  new  edges  that  may  be  produced  in  total  for  all  of  the  productions  counted  in 
column  4.  The  total  work  performed  is  boundelly  (column  4  +  column  5)  *  column  2. 


